linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] iommu/vt-d: Optimize the use of locks
@ 2022-05-27  6:30 Lu Baolu
  2022-05-27  6:30 ` [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs Lu Baolu
                   ` (11 more replies)
  0 siblings, 12 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

Hi folks,

This series tries to optimize the uses of two locks in the Intel IOMMU
driver:

- The intel_iommu::lock is used to protect the IOMMU resources shared by
  devices. They include the IOMMU root and context tables, the pasid
  tables and the domain IDs.
- The global device_domain_lock is used to protect the global and the
  per-domain device tracking lists.

The optimization includes:

- Remove the unnecessary global device tracking list;
- Remove unnecessary locking;
- Reduce the scope of the lock as much as possible, that is, use the
  lock only where necessary;
- The global lock is transformed into a local lock to improve the
  efficiency. 
- Convert spinlock into mutex so that non-critical functions could be
  called when the lock is held.

This series is also available on github:
https://github.com/LuBaolu/intel-iommu/commits/intel-iommu-lock-optimization-v1

Your comments and suggestions are very appreciated.

Best regards,
baolu

Lu Baolu (12):
  iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  iommu/vt-d: Remove for_each_device_domain()
  iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()
  iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()
  iommu/vt-d: Unncessary spinlock for root table alloc and free
  iommu/vt-d: Acquiring lock in domain ID allocation helpers
  iommu/vt-d: Acquiring lock in pasid manipulation helpers
  iommu/vt-d: Replace spin_lock_irqsave() with spin_lock()
  iommu/vt-d: Check device list of domain in domain free path
  iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller
  iommu/vt-d: Use device_domain_lock accurately
  iommu/vt-d: Convert device_domain_lock into per-domain mutex

 drivers/iommu/intel/iommu.h   |   5 +-
 drivers/iommu/intel/debugfs.c |  26 ++--
 drivers/iommu/intel/iommu.c   | 271 +++++++++-------------------------
 drivers/iommu/intel/pasid.c   | 143 +++++++++---------
 drivers/iommu/intel/svm.c     |   5 +-
 5 files changed, 147 insertions(+), 303 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27 14:59   ` Jason Gunthorpe
  2022-05-27  6:30 ` [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain() Lu Baolu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

Retrieve the attached domain for a device through the generic interface
exposed by the iommu core. This also makes device_domain_lock static.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.h   |  1 -
 drivers/iommu/intel/debugfs.c | 20 ++++++++------------
 drivers/iommu/intel/iommu.c   |  2 +-
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a22adfbdf870..8a6d64d726c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -480,7 +480,6 @@ enum {
 #define VTD_FLAG_SVM_CAPABLE		(1 << 2)
 
 extern int intel_iommu_sm;
-extern spinlock_t device_domain_lock;
 
 #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
 #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..eea8727aa7bc 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -344,19 +344,21 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,
 
 static int show_device_domain_translation(struct device *dev, void *data)
 {
-	struct device_domain_info *info = dev_iommu_priv_get(dev);
-	struct dmar_domain *domain = info->domain;
+	struct dmar_domain *dmar_domain;
+	struct iommu_domain *domain;
 	struct seq_file *m = data;
 	u64 path[6] = { 0 };
 
+	domain = iommu_get_domain_for_dev(dev);
 	if (!domain)
 		return 0;
 
+	dmar_domain = to_dmar_domain(domain);
 	seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
-		   (u64)virt_to_phys(domain->pgd));
+		   (u64)virt_to_phys(dmar_domain->pgd));
 	seq_puts(m, "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
 
-	pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
+	pgtable_walk_level(m, dmar_domain->pgd, dmar_domain->agaw + 2, 0, path);
 	seq_putc(m, '\n');
 
 	return 0;
@@ -364,15 +366,9 @@ static int show_device_domain_translation(struct device *dev, void *data)
 
 static int domain_translation_struct_show(struct seq_file *m, void *unused)
 {
-	unsigned long flags;
-	int ret;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
-	ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-			       show_device_domain_translation);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-
-	return ret;
+	return bus_for_each_dev(&pci_bus_type, NULL, m,
+				show_device_domain_translation);
 }
 DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);
 
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1af4b6562266..cacae8bdaa65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
 #define IDENTMAP_GFX		2
 #define IDENTMAP_AZALIA		4
 
-DEFINE_SPINLOCK(device_domain_lock);
+static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain()
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
  2022-05-27  6:30 ` [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27 15:00   ` Jason Gunthorpe
  2022-06-01  8:53   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu() Lu Baolu
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The per-device device_domain_info data could be retrieved from the
device itself. There's no need to search a global list.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.h |  2 --
 drivers/iommu/intel/iommu.c | 25 -------------------------
 drivers/iommu/intel/pasid.c | 37 +++++++++++--------------------------
 3 files changed, 11 insertions(+), 53 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 8a6d64d726c0..2f4a5b9509c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -727,8 +727,6 @@ extern int dmar_ir_support(void);
 void *alloc_pgtable_page(int node);
 void free_pgtable_page(void *vaddr);
 struct intel_iommu *domain_get_iommu(struct dmar_domain *domain);
-int for_each_device_domain(int (*fn)(struct device_domain_info *info,
-				     void *data), void *data);
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index cacae8bdaa65..6549b09d7f32 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -316,31 +316,6 @@ static int iommu_skip_te_disable;
 
 static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);
-
-/*
- * Iterate over elements in device_domain_list and call the specified
- * callback @fn against each element.
- */
-int for_each_device_domain(int (*fn)(struct device_domain_info *info,
-				     void *data), void *data)
-{
-	int ret = 0;
-	unsigned long flags;
-	struct device_domain_info *info;
-
-	spin_lock_irqsave(&device_domain_lock, flags);
-	list_for_each_entry(info, &device_domain_list, global) {
-		ret = fn(info, data);
-		if (ret) {
-			spin_unlock_irqrestore(&device_domain_lock, flags);
-			return ret;
-		}
-	}
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-
-	return 0;
-}
-
 const struct iommu_ops intel_iommu_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index b2ac5869286f..0627d6465f25 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -103,36 +103,20 @@ device_detach_pasid_table(struct device_domain_info *info,
 }
 
 struct pasid_table_opaque {
-	struct pasid_table	**pasid_table;
-	int			segment;
-	int			bus;
-	int			devfn;
+	struct pasid_table	*pasid_table;
 };
 
-static int search_pasid_table(struct device_domain_info *info, void *opaque)
-{
-	struct pasid_table_opaque *data = opaque;
-
-	if (info->iommu->segment == data->segment &&
-	    info->bus == data->bus &&
-	    info->devfn == data->devfn &&
-	    info->pasid_table) {
-		*data->pasid_table = info->pasid_table;
-		return 1;
-	}
-
-	return 0;
-}
-
 static int get_alias_pasid_table(struct pci_dev *pdev, u16 alias, void *opaque)
 {
 	struct pasid_table_opaque *data = opaque;
+	struct device_domain_info *info;
 
-	data->segment = pci_domain_nr(pdev->bus);
-	data->bus = PCI_BUS_NUM(alias);
-	data->devfn = alias & 0xff;
+	info = dev_iommu_priv_get(&pdev->dev);
+	if (!info || !info->pasid_table)
+		return 0;
 
-	return for_each_device_domain(&search_pasid_table, data);
+	data->pasid_table = info->pasid_table;
+	return 1;
 }
 
 /*
@@ -141,9 +125,9 @@ static int get_alias_pasid_table(struct pci_dev *pdev, u16 alias, void *opaque)
  */
 int intel_pasid_alloc_table(struct device *dev)
 {
+	struct pasid_table_opaque data = { NULL };
 	struct device_domain_info *info;
 	struct pasid_table *pasid_table;
-	struct pasid_table_opaque data;
 	struct page *pages;
 	u32 max_pasid = 0;
 	int ret, order;
@@ -155,11 +139,12 @@ int intel_pasid_alloc_table(struct device *dev)
 		return -EINVAL;
 
 	/* DMA alias device already has a pasid table, use it: */
-	data.pasid_table = &pasid_table;
 	ret = pci_for_each_dma_alias(to_pci_dev(dev),
 				     &get_alias_pasid_table, &data);
-	if (ret)
+	if (ret) {
+		pasid_table = data.pasid_table;
 		goto attach_out;
+	}
 
 	pasid_table = kzalloc(sizeof(*pasid_table), GFP_KERNEL);
 	if (!pasid_table)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
  2022-05-27  6:30 ` [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs Lu Baolu
  2022-05-27  6:30 ` [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain() Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27 15:01   ` Jason Gunthorpe
  2022-05-27  6:30 ` [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk() Lu Baolu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The disable_dmar_iommu() is called when IOMMU initialzation fails or
the IOMMU is hot-removed from the system. In both cases, there is no
need to clear the IOMMU translation data structures for devices.

On the initialization path, the device probing only happens after the
IOMMU is initialized successfully, hence there're no translation data
structures.

On the hot-remove path, there is no real use case where the IOMMU is
hot-removed, but the devices that it manages are still alive in the
system. The translation data structures were torn down during device
release, hence there's no need to repeat it in IOMMU hot-remove path
either.

So, let's remove this unnecessary code.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 6549b09d7f32..25d4c5200526 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1715,24 +1715,9 @@ static int iommu_init_domains(struct intel_iommu *iommu)
 
 static void disable_dmar_iommu(struct intel_iommu *iommu)
 {
-	struct device_domain_info *info, *tmp;
-	unsigned long flags;
-
 	if (!iommu->domain_ids)
 		return;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
-	list_for_each_entry_safe(info, tmp, &device_domain_list, global) {
-		if (info->iommu != iommu)
-			continue;
-
-		if (!info->dev || !info->domain)
-			continue;
-
-		__dmar_remove_one_dev_info(info);
-	}
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-
 	if (iommu->gcmd & DMA_GCMD_TE)
 		iommu_disable_translation(iommu);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (2 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu() Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27 15:01   ` Jason Gunthorpe
  2022-06-01  8:56   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free Lu Baolu
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

Use pci_get_domain_bus_and_slot() instead of searching the global list
to retrieve the pci device pointer. This removes device_domain_list
global list as there are no consumers anymore.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.h |  1 -
 drivers/iommu/intel/iommu.c | 33 ++++++---------------------------
 2 files changed, 6 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2f4a5b9509c0..6724703d573b 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -609,7 +609,6 @@ struct intel_iommu {
 /* PCI domain-device relationship */
 struct device_domain_info {
 	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
 	struct list_head table;	/* link to pasid table */
 	u32 segment;		/* PCI segment number */
 	u8 bus;			/* PCI bus number */
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 25d4c5200526..bbdd3417a1b1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -131,8 +131,6 @@ static struct intel_iommu **g_iommus;
 
 static void __init check_tylersburg_isoch(void);
 static int rwbf_quirk;
-static inline struct device_domain_info *
-dmar_search_domain_by_dev_info(int segment, int bus, int devfn);
 
 /*
  * set to 1 to panic kernel if can't successfully enable VT-d
@@ -315,7 +313,6 @@ static int iommu_skip_te_disable;
 #define IDENTMAP_AZALIA		4
 
 static DEFINE_SPINLOCK(device_domain_lock);
-static LIST_HEAD(device_domain_list);
 const struct iommu_ops intel_iommu_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
@@ -845,9 +842,14 @@ static void pgtable_walk(struct intel_iommu *iommu, unsigned long pfn, u8 bus, u
 	struct device_domain_info *info;
 	struct dma_pte *parent, *pte;
 	struct dmar_domain *domain;
+	struct pci_dev *pdev;
 	int offset, level;
 
-	info = dmar_search_domain_by_dev_info(iommu->segment, bus, devfn);
+	pdev = pci_get_domain_bus_and_slot(iommu->segment, bus, devfn);
+	if (!pdev)
+		return;
+
+	info = dev_iommu_priv_get(&pdev->dev);
 	if (!info || !info->domain) {
 		pr_info("device [%02x:%02x.%d] not probed\n",
 			bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
@@ -2348,19 +2350,6 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
 
-static inline struct device_domain_info *
-dmar_search_domain_by_dev_info(int segment, int bus, int devfn)
-{
-	struct device_domain_info *info;
-
-	list_for_each_entry(info, &device_domain_list, global)
-		if (info->segment == segment && info->bus == bus &&
-		    info->devfn == devfn)
-			return info;
-
-	return NULL;
-}
-
 static int domain_setup_first_level(struct intel_iommu *iommu,
 				    struct dmar_domain *domain,
 				    struct device *dev,
@@ -4564,7 +4553,6 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
 	struct pci_dev *pdev = dev_is_pci(dev) ? to_pci_dev(dev) : NULL;
 	struct device_domain_info *info;
 	struct intel_iommu *iommu;
-	unsigned long flags;
 	u8 bus, devfn;
 
 	iommu = device_to_iommu(dev, &bus, &devfn);
@@ -4607,10 +4595,7 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
 		}
 	}
 
-	spin_lock_irqsave(&device_domain_lock, flags);
-	list_add(&info->global, &device_domain_list);
 	dev_iommu_priv_set(dev, info);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	return &iommu->iommu;
 }
@@ -4618,15 +4603,9 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev)
 static void intel_iommu_release_device(struct device *dev)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
-	unsigned long flags;
 
 	dmar_remove_one_dev_info(dev);
-
-	spin_lock_irqsave(&device_domain_lock, flags);
 	dev_iommu_priv_set(dev, NULL);
-	list_del(&info->global);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-
 	kfree(info);
 	set_dma_ops(dev, NULL);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (3 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk() Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-06-01  9:05   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers Lu Baolu
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The IOMMU root table is allocated and freed in the IOMMU initialization
code in static boot or hot-plug paths. There's no need for a spinlock.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index bbdd3417a1b1..2d5f02b85de8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -809,14 +809,12 @@ static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn)
 
 static void free_context_table(struct intel_iommu *iommu)
 {
-	int i;
-	unsigned long flags;
 	struct context_entry *context;
+	int i;
+
+	if (!iommu->root_entry)
+		return;
 
-	spin_lock_irqsave(&iommu->lock, flags);
-	if (!iommu->root_entry) {
-		goto out;
-	}
 	for (i = 0; i < ROOT_ENTRY_NR; i++) {
 		context = iommu_context_addr(iommu, i, 0, 0);
 		if (context)
@@ -828,12 +826,10 @@ static void free_context_table(struct intel_iommu *iommu)
 		context = iommu_context_addr(iommu, i, 0x80, 0);
 		if (context)
 			free_pgtable_page(context);
-
 	}
+
 	free_pgtable_page(iommu->root_entry);
 	iommu->root_entry = NULL;
-out:
-	spin_unlock_irqrestore(&iommu->lock, flags);
 }
 
 #ifdef CONFIG_DMAR_DEBUG
@@ -1232,7 +1228,6 @@ static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,
 static int iommu_alloc_root_entry(struct intel_iommu *iommu)
 {
 	struct root_entry *root;
-	unsigned long flags;
 
 	root = (struct root_entry *)alloc_pgtable_page(iommu->node);
 	if (!root) {
@@ -1242,10 +1237,7 @@ static int iommu_alloc_root_entry(struct intel_iommu *iommu)
 	}
 
 	__iommu_flush_cache(iommu, root, ROOT_SIZE);
-
-	spin_lock_irqsave(&iommu->lock, flags);
 	iommu->root_entry = root;
-	spin_unlock_irqrestore(&iommu->lock, flags);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (4 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-06-01  9:09   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers Lu Baolu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The iommu->lock is used to protect the per-IOMMU domain ID resource.
Move the spinlock acquisition/release into the helpers where domain
IDs are allocated and freed. The device_domain_lock is irrelevant to
domain ID resources, remove its assertion as well.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2d5f02b85de8..0da937ce0534 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1774,16 +1774,13 @@ static struct dmar_domain *alloc_domain(unsigned int type)
 	return domain;
 }
 
-/* Must be called with iommu->lock */
 static int domain_attach_iommu(struct dmar_domain *domain,
 			       struct intel_iommu *iommu)
 {
 	unsigned long ndomains;
-	int num;
-
-	assert_spin_locked(&device_domain_lock);
-	assert_spin_locked(&iommu->lock);
+	int num, ret = 0;
 
+	spin_lock(&iommu->lock);
 	domain->iommu_refcnt[iommu->seq_id] += 1;
 	if (domain->iommu_refcnt[iommu->seq_id] == 1) {
 		ndomains = cap_ndoms(iommu->cap);
@@ -1792,7 +1789,8 @@ static int domain_attach_iommu(struct dmar_domain *domain,
 		if (num >= ndomains) {
 			pr_err("%s: No free domain ids\n", iommu->name);
 			domain->iommu_refcnt[iommu->seq_id] -= 1;
-			return -ENOSPC;
+			ret = -ENOSPC;
+			goto out_unlock;
 		}
 
 		set_bit(num, iommu->domain_ids);
@@ -1801,7 +1799,9 @@ static int domain_attach_iommu(struct dmar_domain *domain,
 		domain_update_iommu_cap(domain);
 	}
 
-	return 0;
+out_unlock:
+	spin_unlock(&iommu->lock);
+	return ret;
 }
 
 static void domain_detach_iommu(struct dmar_domain *domain,
@@ -1809,9 +1809,7 @@ static void domain_detach_iommu(struct dmar_domain *domain,
 {
 	int num;
 
-	assert_spin_locked(&device_domain_lock);
-	assert_spin_locked(&iommu->lock);
-
+	spin_lock(&iommu->lock);
 	domain->iommu_refcnt[iommu->seq_id] -= 1;
 	if (domain->iommu_refcnt[iommu->seq_id] == 0) {
 		num = domain->iommu_did[iommu->seq_id];
@@ -1819,6 +1817,7 @@ static void domain_detach_iommu(struct dmar_domain *domain,
 		domain_update_iommu_cap(domain);
 		domain->iommu_did[iommu->seq_id] = 0;
 	}
+	spin_unlock(&iommu->lock);
 }
 
 static inline int guestwidth_to_adjustwidth(int gaw)
@@ -2471,9 +2470,7 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 
 	spin_lock_irqsave(&device_domain_lock, flags);
 	info->domain = domain;
-	spin_lock(&iommu->lock);
 	ret = domain_attach_iommu(domain, iommu);
-	spin_unlock(&iommu->lock);
 	if (ret) {
 		spin_unlock_irqrestore(&device_domain_lock, flags);
 		return ret;
@@ -4158,7 +4155,6 @@ static void __dmar_remove_one_dev_info(struct device_domain_info *info)
 {
 	struct dmar_domain *domain;
 	struct intel_iommu *iommu;
-	unsigned long flags;
 
 	assert_spin_locked(&device_domain_lock);
 
@@ -4179,10 +4175,7 @@ static void __dmar_remove_one_dev_info(struct device_domain_info *info)
 	}
 
 	list_del(&info->link);
-
-	spin_lock_irqsave(&iommu->lock, flags);
 	domain_detach_iommu(domain, iommu);
-	spin_unlock_irqrestore(&iommu->lock, flags);
 }
 
 static void dmar_remove_one_dev_info(struct device *dev)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (5 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-06-01  9:18   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 08/12] iommu/vt-d: Replace spin_lock_irqsave() with spin_lock() Lu Baolu
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The iommu->lock is used to protect the per-IOMMU pasid directory table
and pasid table. Move the spinlock acquisition/release into the helpers
to make the code self-contained.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c |   2 -
 drivers/iommu/intel/pasid.c | 106 +++++++++++++++++++-----------------
 drivers/iommu/intel/svm.c   |   5 +-
 3 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0da937ce0534..ccf3c7fa26f1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2488,7 +2488,6 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 		}
 
 		/* Setup the PASID entry for requests without PASID: */
-		spin_lock_irqsave(&iommu->lock, flags);
 		if (hw_pass_through && domain_type_is_si(domain))
 			ret = intel_pasid_setup_pass_through(iommu, domain,
 					dev, PASID_RID2PASID);
@@ -2498,7 +2497,6 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 		else
 			ret = intel_pasid_setup_second_level(iommu, domain,
 					dev, PASID_RID2PASID);
-		spin_unlock_irqrestore(&iommu->lock, flags);
 		if (ret) {
 			dev_err(dev, "Setup RID2PASID failed\n");
 			dmar_remove_one_dev_info(dev);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0627d6465f25..bab5c385fa1e 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -498,17 +498,17 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
 	struct pasid_entry *pte;
 	u16 did, pgtt;
 
+	spin_lock(&iommu->lock);
 	pte = intel_pasid_get_entry(dev, pasid);
-	if (WARN_ON(!pte))
-		return;
-
-	if (!pasid_pte_is_present(pte))
+	if (WARN_ON(!pte) || !pasid_pte_is_present(pte)) {
+		spin_unlock(&iommu->lock);
 		return;
+	}
 
 	did = pasid_get_domain_id(pte);
 	pgtt = pasid_pte_get_pgtt(pte);
-
 	intel_pasid_clear_entry(dev, pasid, fault_ignore);
+	spin_unlock(&iommu->lock);
 
 	if (!ecap_coherent(iommu->ecap))
 		clflush_cache_range(pte, sizeof(*pte));
@@ -544,21 +544,17 @@ static void pasid_flush_caches(struct intel_iommu *iommu,
 	}
 }
 
-static inline int pasid_enable_wpe(struct pasid_entry *pte)
+static struct pasid_entry *get_non_present_pasid_entry(struct device *dev,
+						       u32 pasid)
 {
-#ifdef CONFIG_X86
-	unsigned long cr0 = read_cr0();
+	struct pasid_entry *pte;
 
-	/* CR0.WP is normally set but just to be sure */
-	if (unlikely(!(cr0 & X86_CR0_WP))) {
-		pr_err_ratelimited("No CPU write protect!\n");
-		return -EINVAL;
-	}
-#endif
-	pasid_set_wpe(pte);
+	pte = intel_pasid_get_entry(dev, pasid);
+	if (!pte || pasid_pte_is_present(pte))
+		return NULL;
 
-	return 0;
-};
+	return pte;
+}
 
 /*
  * Set up the scalable mode pasid table entry for first only
@@ -576,39 +572,47 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 		return -EINVAL;
 	}
 
-	pte = intel_pasid_get_entry(dev, pasid);
-	if (WARN_ON(!pte))
+	if ((flags & PASID_FLAG_SUPERVISOR_MODE)) {
+#ifdef CONFIG_X86
+		unsigned long cr0 = read_cr0();
+
+		/* CR0.WP is normally set but just to be sure */
+		if (unlikely(!(cr0 & X86_CR0_WP))) {
+			pr_err("No CPU write protect!\n");
+			return -EINVAL;
+		}
+#endif
+		if (!ecap_srs(iommu->ecap)) {
+			pr_err("No supervisor request support on %s\n",
+			       iommu->name);
+			return -EINVAL;
+		}
+	}
+
+	if ((flags & PASID_FLAG_FL5LP) && !cap_5lp_support(iommu->cap)) {
+		pr_err("No 5-level paging support for first-level on %s\n",
+		       iommu->name);
 		return -EINVAL;
+	}
 
-	/* Caller must ensure PASID entry is not in use. */
-	if (pasid_pte_is_present(pte))
-		return -EBUSY;
+	spin_lock(&iommu->lock);
+	pte = get_non_present_pasid_entry(dev, pasid);
+	if (!pte) {
+		spin_unlock(&iommu->lock);
+		return -ENODEV;
+	}
 
 	pasid_clear_entry(pte);
 
 	/* Setup the first level page table pointer: */
 	pasid_set_flptr(pte, (u64)__pa(pgd));
 	if (flags & PASID_FLAG_SUPERVISOR_MODE) {
-		if (!ecap_srs(iommu->ecap)) {
-			pr_err("No supervisor request support on %s\n",
-			       iommu->name);
-			return -EINVAL;
-		}
 		pasid_set_sre(pte);
-		if (pasid_enable_wpe(pte))
-			return -EINVAL;
-
+		pasid_set_wpe(pte);
 	}
 
-	if (flags & PASID_FLAG_FL5LP) {
-		if (cap_5lp_support(iommu->cap)) {
-			pasid_set_flpm(pte, 1);
-		} else {
-			pr_err("No 5-level paging support for first-level\n");
-			pasid_clear_entry(pte);
-			return -EINVAL;
-		}
-	}
+	if (flags & PASID_FLAG_FL5LP)
+		pasid_set_flpm(pte, 1);
 
 	if (flags & PASID_FLAG_PAGE_SNOOP)
 		pasid_set_pgsnp(pte);
@@ -620,6 +624,8 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 	/* Setup Present and PASID Granular Transfer Type: */
 	pasid_set_translation_type(pte, PASID_ENTRY_PGTT_FL_ONLY);
 	pasid_set_present(pte);
+	spin_unlock(&iommu->lock);
+
 	pasid_flush_caches(iommu, pte, pasid, did);
 
 	return 0;
@@ -677,16 +683,13 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	pgd_val = virt_to_phys(pgd);
 	did = domain->iommu_did[iommu->seq_id];
 
-	pte = intel_pasid_get_entry(dev, pasid);
+	spin_lock(&iommu->lock);
+	pte = get_non_present_pasid_entry(dev, pasid);
 	if (!pte) {
-		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		spin_unlock(&iommu->lock);
 		return -ENODEV;
 	}
 
-	/* Caller must ensure PASID entry is not in use. */
-	if (pasid_pte_is_present(pte))
-		return -EBUSY;
-
 	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_slptr(pte, pgd_val);
@@ -702,6 +705,8 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 	if (pasid != PASID_RID2PASID)
 		pasid_set_sre(pte);
 	pasid_set_present(pte);
+	spin_unlock(&iommu->lock);
+
 	pasid_flush_caches(iommu, pte, pasid, did);
 
 	return 0;
@@ -717,16 +722,13 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 	u16 did = FLPT_DEFAULT_DID;
 	struct pasid_entry *pte;
 
-	pte = intel_pasid_get_entry(dev, pasid);
+	spin_lock(&iommu->lock);
+	pte = get_non_present_pasid_entry(dev, pasid);
 	if (!pte) {
-		dev_err(dev, "Failed to get pasid entry of PASID %d\n", pasid);
+		spin_unlock(&iommu->lock);
 		return -ENODEV;
 	}
 
-	/* Caller must ensure PASID entry is not in use. */
-	if (pasid_pte_is_present(pte))
-		return -EBUSY;
-
 	pasid_clear_entry(pte);
 	pasid_set_domain_id(pte, did);
 	pasid_set_address_width(pte, iommu->agaw);
@@ -740,6 +742,8 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 	 */
 	pasid_set_sre(pte);
 	pasid_set_present(pte);
+	spin_unlock(&iommu->lock);
+
 	pasid_flush_caches(iommu, pte, pasid, did);
 
 	return 0;
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 580713aa9e07..64072e628bbd 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -328,9 +328,9 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu,
 					   unsigned int flags)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
-	unsigned long iflags, sflags;
 	struct intel_svm_dev *sdev;
 	struct intel_svm *svm;
+	unsigned long sflags;
 	int ret = 0;
 
 	svm = pasid_private_find(mm->pasid);
@@ -394,11 +394,8 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu,
 	sflags = (flags & SVM_FLAG_SUPERVISOR_MODE) ?
 			PASID_FLAG_SUPERVISOR_MODE : 0;
 	sflags |= cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0;
-	spin_lock_irqsave(&iommu->lock, iflags);
 	ret = intel_pasid_setup_first_level(iommu, dev, mm->pgd, mm->pasid,
 					    FLPT_DEFAULT_DID, sflags);
-	spin_unlock_irqrestore(&iommu->lock, iflags);
-
 	if (ret)
 		goto free_sdev;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/12] iommu/vt-d: Replace spin_lock_irqsave() with spin_lock()
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (6 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27  6:30 ` [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path Lu Baolu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The iommu->lock is used to protect changes in root/context/pasid tables
and domain ID allocation. There's no use case to change these resources
in any interrupt context. Hence there's no need to disable interrupts
when helding the spinlock.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/debugfs.c |  6 ++----
 drivers/iommu/intel/iommu.c   | 17 +++++++----------
 2 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index eea8727aa7bc..ca1adceeba19 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -263,10 +263,9 @@ static void ctx_tbl_walk(struct seq_file *m, struct intel_iommu *iommu, u16 bus)
 
 static void root_tbl_walk(struct seq_file *m, struct intel_iommu *iommu)
 {
-	unsigned long flags;
 	u16 bus;
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	spin_lock(&iommu->lock);
 	seq_printf(m, "IOMMU %s: Root Table Address: 0x%llx\n", iommu->name,
 		   (u64)virt_to_phys(iommu->root_entry));
 	seq_puts(m, "B.D.F\tRoot_entry\t\t\t\tContext_entry\t\t\t\tPASID\tPASID_table_entry\n");
@@ -278,8 +277,7 @@ static void root_tbl_walk(struct seq_file *m, struct intel_iommu *iommu)
 	 */
 	for (bus = 0; bus < 256; bus++)
 		ctx_tbl_walk(m, iommu, bus);
-
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	spin_unlock(&iommu->lock);
 }
 
 static int dmar_translation_struct_show(struct seq_file *m, void *unused)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ccf3c7fa26f1..2e195a639502 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -797,13 +797,12 @@ static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn)
 {
 	struct context_entry *context;
 	int ret = 0;
-	unsigned long flags;
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	spin_lock(&iommu->lock);
 	context = iommu_context_addr(iommu, bus, devfn, 0);
 	if (context)
 		ret = context_present(context);
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	spin_unlock(&iommu->lock);
 	return ret;
 }
 
@@ -2287,16 +2286,15 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 {
 	struct intel_iommu *iommu = info->iommu;
 	struct context_entry *context;
-	unsigned long flags;
 	u16 did_old;
 
 	if (!iommu)
 		return;
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	spin_lock(&iommu->lock);
 	context = iommu_context_addr(iommu, bus, devfn, 0);
 	if (!context) {
-		spin_unlock_irqrestore(&iommu->lock, flags);
+		spin_unlock(&iommu->lock);
 		return;
 	}
 
@@ -2311,7 +2309,7 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 
 	context_clear_entry(context);
 	__iommu_flush_cache(iommu, context, sizeof(*context));
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	spin_unlock(&iommu->lock);
 	iommu->flush.flush_context(iommu,
 				   did_old,
 				   (((u16)bus) << 8) | devfn,
@@ -2764,7 +2762,6 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 	struct root_entry *old_rt;
 	phys_addr_t old_rt_phys;
 	int ctxt_table_entries;
-	unsigned long flags;
 	u64 rtaddr_reg;
 	int bus, ret;
 	bool new_ext, ext;
@@ -2807,7 +2804,7 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 		}
 	}
 
-	spin_lock_irqsave(&iommu->lock, flags);
+	spin_lock(&iommu->lock);
 
 	/* Context tables are copied, now write them to the root_entry table */
 	for (bus = 0; bus < 256; bus++) {
@@ -2826,7 +2823,7 @@ static int copy_translation_tables(struct intel_iommu *iommu)
 		iommu->root_entry[bus].hi = val;
 	}
 
-	spin_unlock_irqrestore(&iommu->lock, flags);
+	spin_unlock(&iommu->lock);
 
 	kfree(ctxt_tbls);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (7 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 08/12] iommu/vt-d: Replace spin_lock_irqsave() with spin_lock() Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27 15:05   ` Jason Gunthorpe
  2022-06-01  9:28   ` Tian, Kevin
  2022-05-27  6:30 ` [PATCH 10/12] iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller Lu Baolu
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

When the IOMMU domain is about to be freed, it should not be set on any
device. Instead of silently dealing with some bug cases, it's better to
trigger a warning to report and fix any potential bugs at the first time.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2e195a639502..6f3119c68cd2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -294,7 +294,6 @@ static LIST_HEAD(dmar_satc_units);
 /* bitmap for indexing intel_iommus */
 static int g_num_of_iommus;
 
-static void domain_remove_dev_info(struct dmar_domain *domain);
 static void dmar_remove_one_dev_info(struct device *dev);
 static void __dmar_remove_one_dev_info(struct device_domain_info *info);
 
@@ -1835,9 +1834,8 @@ static inline int guestwidth_to_adjustwidth(int gaw)
 
 static void domain_exit(struct dmar_domain *domain)
 {
-
-	/* Remove associated devices and clear attached or cached domains */
-	domain_remove_dev_info(domain);
+	if (WARN_ON(!list_empty(&domain->devices)))
+		return;
 
 	if (domain->pgd) {
 		LIST_HEAD(freelist);
@@ -2328,17 +2326,6 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 	__iommu_flush_dev_iotlb(info, 0, MAX_AGAW_PFN_WIDTH);
 }
 
-static void domain_remove_dev_info(struct dmar_domain *domain)
-{
-	struct device_domain_info *info, *tmp;
-	unsigned long flags;
-
-	spin_lock_irqsave(&device_domain_lock, flags);
-	list_for_each_entry_safe(info, tmp, &domain->devices, link)
-		__dmar_remove_one_dev_info(info);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-}
-
 static int domain_setup_first_level(struct intel_iommu *iommu,
 				    struct dmar_domain *domain,
 				    struct device *dev,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/12] iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (8 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27  6:30 ` [PATCH 11/12] iommu/vt-d: Use device_domain_lock accurately Lu Baolu
  2022-05-27  6:30 ` [PATCH 12/12] iommu/vt-d: Convert device_domain_lock into per-domain mutex Lu Baolu
  11 siblings, 0 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

Fold __dmar_remove_one_dev_info() into dmar_remove_one_dev_info() which
is its only caller. Make the spin lock critical range only cover the
device list change code and remove some unnecessary checks.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 34 +++++++++-------------------------
 1 file changed, 9 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 6f3119c68cd2..d02ddd338afd 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -295,7 +295,6 @@ static LIST_HEAD(dmar_satc_units);
 static int g_num_of_iommus;
 
 static void dmar_remove_one_dev_info(struct device *dev);
-static void __dmar_remove_one_dev_info(struct device_domain_info *info);
 
 int dmar_disabled = !IS_ENABLED(CONFIG_INTEL_IOMMU_DEFAULT_ON);
 int intel_iommu_sm = IS_ENABLED(CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON);
@@ -4133,20 +4132,14 @@ static void domain_context_clear(struct device_domain_info *info)
 			       &domain_context_clear_one_cb, info);
 }
 
-static void __dmar_remove_one_dev_info(struct device_domain_info *info)
+static void dmar_remove_one_dev_info(struct device *dev)
 {
-	struct dmar_domain *domain;
-	struct intel_iommu *iommu;
-
-	assert_spin_locked(&device_domain_lock);
-
-	if (WARN_ON(!info))
-		return;
-
-	iommu = info->iommu;
-	domain = info->domain;
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
+	struct dmar_domain *domain = info->domain;
+	struct intel_iommu *iommu = info->iommu;
+	unsigned long flags;
 
-	if (info->dev && !dev_is_real_dma_subdevice(info->dev)) {
+	if (!dev_is_real_dma_subdevice(info->dev)) {
 		if (dev_is_pci(info->dev) && sm_supported(iommu))
 			intel_pasid_tear_down_entry(iommu, info->dev,
 					PASID_RID2PASID, false);
@@ -4156,20 +4149,11 @@ static void __dmar_remove_one_dev_info(struct device_domain_info *info)
 		intel_pasid_free_table(info->dev);
 	}
 
-	list_del(&info->link);
-	domain_detach_iommu(domain, iommu);
-}
-
-static void dmar_remove_one_dev_info(struct device *dev)
-{
-	struct device_domain_info *info;
-	unsigned long flags;
-
 	spin_lock_irqsave(&device_domain_lock, flags);
-	info = dev_iommu_priv_get(dev);
-	if (info)
-		__dmar_remove_one_dev_info(info);
+	list_del(&info->link);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
+
+	domain_detach_iommu(domain, iommu);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/12] iommu/vt-d: Use device_domain_lock accurately
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (9 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 10/12] iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  2022-05-27  6:30 ` [PATCH 12/12] iommu/vt-d: Convert device_domain_lock into per-domain mutex Lu Baolu
  11 siblings, 0 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

The device_domain_lock is used to protect the device tracking list of
a domain. Remove unnecessary spin_lock/unlock()'s and move the necessary
ones around the list access.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 68 +++++++++++++++----------------------
 1 file changed, 27 insertions(+), 41 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d02ddd338afd..f8aa8649dc6f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -534,16 +534,10 @@ static int domain_update_device_node(struct dmar_domain *domain)
 {
 	struct device_domain_info *info;
 	int nid = NUMA_NO_NODE;
+	unsigned long flags;
 
-	assert_spin_locked(&device_domain_lock);
-
-	if (list_empty(&domain->devices))
-		return NUMA_NO_NODE;
-
+	spin_lock_irqsave(&device_domain_lock, flags);
 	list_for_each_entry(info, &domain->devices, link) {
-		if (!info->dev)
-			continue;
-
 		/*
 		 * There could possibly be multiple device numa nodes as devices
 		 * within the same domain may sit behind different IOMMUs. There
@@ -554,6 +548,7 @@ static int domain_update_device_node(struct dmar_domain *domain)
 		if (nid != NUMA_NO_NODE)
 			break;
 	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	return nid;
 }
@@ -1376,49 +1371,50 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did,
 }
 
 static struct device_domain_info *
-iommu_support_dev_iotlb (struct dmar_domain *domain, struct intel_iommu *iommu,
-			 u8 bus, u8 devfn)
+iommu_support_dev_iotlb(struct dmar_domain *domain, struct intel_iommu *iommu,
+			u8 bus, u8 devfn)
 {
-	struct device_domain_info *info;
-
-	assert_spin_locked(&device_domain_lock);
+	struct device_domain_info *info = NULL, *tmp;
+	unsigned long flags;
 
 	if (!iommu->qi)
 		return NULL;
 
-	list_for_each_entry(info, &domain->devices, link)
-		if (info->iommu == iommu && info->bus == bus &&
-		    info->devfn == devfn) {
-			if (info->ats_supported && info->dev)
-				return info;
+	spin_lock_irqsave(&device_domain_lock, flags);
+	list_for_each_entry(tmp, &domain->devices, link) {
+		if (tmp->iommu == iommu && tmp->bus == bus &&
+		    tmp->devfn == devfn) {
+			if (tmp->ats_supported)
+				info = tmp;
 			break;
 		}
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
 
-	return NULL;
+	return info;
 }
 
 static void domain_update_iotlb(struct dmar_domain *domain)
 {
 	struct device_domain_info *info;
 	bool has_iotlb_device = false;
+	unsigned long flags;
 
-	assert_spin_locked(&device_domain_lock);
-
-	list_for_each_entry(info, &domain->devices, link)
+	spin_lock_irqsave(&device_domain_lock, flags);
+	list_for_each_entry(info, &domain->devices, link) {
 		if (info->ats_enabled) {
 			has_iotlb_device = true;
 			break;
 		}
-
+	}
 	domain->has_iotlb_device = has_iotlb_device;
+	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
 
 static void iommu_enable_dev_iotlb(struct device_domain_info *info)
 {
 	struct pci_dev *pdev;
 
-	assert_spin_locked(&device_domain_lock);
-
 	if (!info || !dev_is_pci(info->dev))
 		return;
 
@@ -1464,8 +1460,6 @@ static void iommu_disable_dev_iotlb(struct device_domain_info *info)
 {
 	struct pci_dev *pdev;
 
-	assert_spin_locked(&device_domain_lock);
-
 	if (!dev_is_pci(info->dev))
 		return;
 
@@ -1900,11 +1894,11 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 				      struct pasid_table *table,
 				      u8 bus, u8 devfn)
 {
+	struct device_domain_info *info =
+			iommu_support_dev_iotlb(domain, iommu, bus, devfn);
 	u16 did = domain->iommu_did[iommu->seq_id];
 	int translation = CONTEXT_TT_MULTI_LEVEL;
-	struct device_domain_info *info = NULL;
 	struct context_entry *context;
-	unsigned long flags;
 	int ret;
 
 	WARN_ON(did == 0);
@@ -1917,7 +1911,6 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 
 	BUG_ON(!domain->pgd);
 
-	spin_lock_irqsave(&device_domain_lock, flags);
 	spin_lock(&iommu->lock);
 
 	ret = -ENOMEM;
@@ -1970,7 +1963,6 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 		 * Setup the Device-TLB enable bit and Page request
 		 * Enable bit:
 		 */
-		info = iommu_support_dev_iotlb(domain, iommu, bus, devfn);
 		if (info && info->ats_supported)
 			context_set_sm_dte(context);
 		if (info && info->pri_supported)
@@ -1993,7 +1985,6 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 					goto out_unlock;
 			}
 
-			info = iommu_support_dev_iotlb(domain, iommu, bus, devfn);
 			if (info && info->ats_supported)
 				translation = CONTEXT_TT_DEV_IOTLB;
 			else
@@ -2039,7 +2030,6 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 
 out_unlock:
 	spin_unlock(&iommu->lock);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	return ret;
 }
@@ -2452,15 +2442,14 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 	if (!iommu)
 		return -ENODEV;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
-	info->domain = domain;
 	ret = domain_attach_iommu(domain, iommu);
-	if (ret) {
-		spin_unlock_irqrestore(&device_domain_lock, flags);
+	if (ret)
 		return ret;
-	}
+
+	spin_lock_irqsave(&device_domain_lock, flags);
 	list_add(&info->link, &domain->devices);
 	spin_unlock_irqrestore(&device_domain_lock, flags);
+	info->domain = domain;
 
 	/* PASID table is mandatory for a PCI device in scalable mode. */
 	if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev)) {
@@ -4629,7 +4618,6 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct context_entry *context;
 	struct dmar_domain *domain;
-	unsigned long flags;
 	u64 ctx_lo;
 	int ret;
 
@@ -4637,7 +4625,6 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 	if (!domain)
 		return -EINVAL;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
 	spin_lock(&iommu->lock);
 
 	ret = -EINVAL;
@@ -4669,7 +4656,6 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 
  out:
 	spin_unlock(&iommu->lock);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	return ret;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/12] iommu/vt-d: Convert device_domain_lock into per-domain mutex
  2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
                   ` (10 preceding siblings ...)
  2022-05-27  6:30 ` [PATCH 11/12] iommu/vt-d: Use device_domain_lock accurately Lu Baolu
@ 2022-05-27  6:30 ` Lu Baolu
  11 siblings, 0 replies; 55+ messages in thread
From: Lu Baolu @ 2022-05-27  6:30 UTC (permalink / raw)
  To: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel, Lu Baolu

Using a global device_domain_lock spinlock to protect per-domain device
tracking lists is an inefficient way, especially considering this lock
is also needed in the hot paths.

On the other hand, in the iommu_unmap() path, the driver needs to iterate
over the device tracking list and flush the caches on the devices through
qi_submit_sync(), where unfortunately cpu_relax() is used. In order to
avoid holding a spinlock lock when cpu_relax() is called, this also covert
the spinlock into a mutex one. This works as the device tracking lists are
not touched in any interrupt contexts.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.h |  1 +
 drivers/iommu/intel/iommu.c | 45 +++++++++++++++----------------------
 2 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 6724703d573b..9e572ddffc08 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -541,6 +541,7 @@ struct dmar_domain {
 	u8 force_snooping : 1;		/* Create IOPTEs with snoop control */
 	u8 set_pte_snp:1;
 
+	struct mutex mutex;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
 
 	struct dma_pte	*pgd;		/* virtual address */
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f8aa8649dc6f..1815a9d73426 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -310,7 +310,6 @@ static int iommu_skip_te_disable;
 #define IDENTMAP_GFX		2
 #define IDENTMAP_AZALIA		4
 
-static DEFINE_SPINLOCK(device_domain_lock);
 const struct iommu_ops intel_iommu_ops;
 
 static bool translation_pre_enabled(struct intel_iommu *iommu)
@@ -534,9 +533,8 @@ static int domain_update_device_node(struct dmar_domain *domain)
 {
 	struct device_domain_info *info;
 	int nid = NUMA_NO_NODE;
-	unsigned long flags;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_for_each_entry(info, &domain->devices, link) {
 		/*
 		 * There could possibly be multiple device numa nodes as devices
@@ -548,7 +546,7 @@ static int domain_update_device_node(struct dmar_domain *domain)
 		if (nid != NUMA_NO_NODE)
 			break;
 	}
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 
 	return nid;
 }
@@ -1375,12 +1373,11 @@ iommu_support_dev_iotlb(struct dmar_domain *domain, struct intel_iommu *iommu,
 			u8 bus, u8 devfn)
 {
 	struct device_domain_info *info = NULL, *tmp;
-	unsigned long flags;
 
 	if (!iommu->qi)
 		return NULL;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_for_each_entry(tmp, &domain->devices, link) {
 		if (tmp->iommu == iommu && tmp->bus == bus &&
 		    tmp->devfn == devfn) {
@@ -1389,7 +1386,7 @@ iommu_support_dev_iotlb(struct dmar_domain *domain, struct intel_iommu *iommu,
 			break;
 		}
 	}
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 
 	return info;
 }
@@ -1398,9 +1395,8 @@ static void domain_update_iotlb(struct dmar_domain *domain)
 {
 	struct device_domain_info *info;
 	bool has_iotlb_device = false;
-	unsigned long flags;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_for_each_entry(info, &domain->devices, link) {
 		if (info->ats_enabled) {
 			has_iotlb_device = true;
@@ -1408,7 +1404,7 @@ static void domain_update_iotlb(struct dmar_domain *domain)
 		}
 	}
 	domain->has_iotlb_device = has_iotlb_device;
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 }
 
 static void iommu_enable_dev_iotlb(struct device_domain_info *info)
@@ -1499,17 +1495,15 @@ static void __iommu_flush_dev_iotlb(struct device_domain_info *info,
 static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 				  u64 addr, unsigned mask)
 {
-	unsigned long flags;
 	struct device_domain_info *info;
 
 	if (!domain->has_iotlb_device)
 		return;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_for_each_entry(info, &domain->devices, link)
 		__iommu_flush_dev_iotlb(info, addr, mask);
-
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 }
 
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
@@ -1761,6 +1755,7 @@ static struct dmar_domain *alloc_domain(unsigned int type)
 		domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
 	domain->has_iotlb_device = false;
 	INIT_LIST_HEAD(&domain->devices);
+	mutex_init(&domain->mutex);
 
 	return domain;
 }
@@ -2434,7 +2429,6 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct intel_iommu *iommu;
-	unsigned long flags;
 	u8 bus, devfn;
 	int ret;
 
@@ -2446,9 +2440,9 @@ static int domain_add_dev_info(struct dmar_domain *domain, struct device *dev)
 	if (ret)
 		return ret;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_add(&info->link, &domain->devices);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 	info->domain = domain;
 
 	/* PASID table is mandatory for a PCI device in scalable mode. */
@@ -4126,7 +4120,6 @@ static void dmar_remove_one_dev_info(struct device *dev)
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct dmar_domain *domain = info->domain;
 	struct intel_iommu *iommu = info->iommu;
-	unsigned long flags;
 
 	if (!dev_is_real_dma_subdevice(info->dev)) {
 		if (dev_is_pci(info->dev) && sm_supported(iommu))
@@ -4138,9 +4131,9 @@ static void dmar_remove_one_dev_info(struct device *dev)
 		intel_pasid_free_table(info->dev);
 	}
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&domain->mutex);
 	list_del(&info->link);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&domain->mutex);
 
 	domain_detach_iommu(domain, iommu);
 }
@@ -4424,7 +4417,7 @@ static bool domain_support_force_snooping(struct dmar_domain *domain)
 	struct device_domain_info *info;
 	bool support = true;
 
-	assert_spin_locked(&device_domain_lock);
+	lockdep_assert_held(&domain->mutex);
 	list_for_each_entry(info, &domain->devices, link) {
 		if (!ecap_sc_support(info->iommu->ecap)) {
 			support = false;
@@ -4439,8 +4432,7 @@ static void domain_set_force_snooping(struct dmar_domain *domain)
 {
 	struct device_domain_info *info;
 
-	assert_spin_locked(&device_domain_lock);
-
+	lockdep_assert_held(&domain->mutex);
 	/*
 	 * Second level page table supports per-PTE snoop control. The
 	 * iommu_map() interface will handle this by setting SNP bit.
@@ -4458,20 +4450,19 @@ static void domain_set_force_snooping(struct dmar_domain *domain)
 static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-	unsigned long flags;
 
 	if (dmar_domain->force_snooping)
 		return true;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
+	mutex_lock(&dmar_domain->mutex);
 	if (!domain_support_force_snooping(dmar_domain)) {
-		spin_unlock_irqrestore(&device_domain_lock, flags);
+		mutex_unlock(&dmar_domain->mutex);
 		return false;
 	}
 
 	domain_set_force_snooping(dmar_domain);
 	dmar_domain->force_snooping = true;
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	mutex_unlock(&dmar_domain->mutex);
 
 	return true;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-27  6:30 ` [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs Lu Baolu
@ 2022-05-27 14:59   ` Jason Gunthorpe
  2022-05-29  5:14     ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 14:59 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Fri, May 27, 2022 at 02:30:08PM +0800, Lu Baolu wrote:
> Retrieve the attached domain for a device through the generic interface
> exposed by the iommu core. This also makes device_domain_lock static.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>  drivers/iommu/intel/iommu.h   |  1 -
>  drivers/iommu/intel/debugfs.c | 20 ++++++++------------
>  drivers/iommu/intel/iommu.c   |  2 +-
>  3 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index a22adfbdf870..8a6d64d726c0 100644
> +++ b/drivers/iommu/intel/iommu.h
> @@ -480,7 +480,6 @@ enum {
>  #define VTD_FLAG_SVM_CAPABLE		(1 << 2)
>  
>  extern int intel_iommu_sm;
> -extern spinlock_t device_domain_lock;
>  
>  #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
>  #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
> diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
> index d927ef10641b..eea8727aa7bc 100644
> +++ b/drivers/iommu/intel/debugfs.c
> @@ -344,19 +344,21 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,
>  
>  static int show_device_domain_translation(struct device *dev, void *data)
>  {
> -	struct device_domain_info *info = dev_iommu_priv_get(dev);
> -	struct dmar_domain *domain = info->domain;
> +	struct dmar_domain *dmar_domain;
> +	struct iommu_domain *domain;
>  	struct seq_file *m = data;
>  	u64 path[6] = { 0 };
>  
> +	domain = iommu_get_domain_for_dev(dev);
>  	if (!domain)
>  		return 0;

The iommu_get_domain_for_dev() API should be called something like
'iommu_get_dma_api_domain()' and clearly documented that it is safe to
call only so long as a DMA API using driver is attached to the device,
which is most of the current callers.

This use in random sysfs inside the iommu driver is not OK because it
doesn't have any locking protecting domain from concurrent free.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain()
  2022-05-27  6:30 ` [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain() Lu Baolu
@ 2022-05-27 15:00   ` Jason Gunthorpe
  2022-06-01  8:53   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 15:00 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Fri, May 27, 2022 at 02:30:09PM +0800, Lu Baolu wrote:
> The per-device device_domain_info data could be retrieved from the
> device itself. There's no need to search a global list.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel/iommu.h |  2 --
>  drivers/iommu/intel/iommu.c | 25 -------------------------
>  drivers/iommu/intel/pasid.c | 37 +++++++++++--------------------------
>  3 files changed, 11 insertions(+), 53 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()
  2022-05-27  6:30 ` [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu() Lu Baolu
@ 2022-05-27 15:01   ` Jason Gunthorpe
  2022-05-29  5:22     ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 15:01 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Fri, May 27, 2022 at 02:30:10PM +0800, Lu Baolu wrote:
> The disable_dmar_iommu() is called when IOMMU initialzation fails or
> the IOMMU is hot-removed from the system. In both cases, there is no
> need to clear the IOMMU translation data structures for devices.
> 
> On the initialization path, the device probing only happens after the
> IOMMU is initialized successfully, hence there're no translation data
> structures.
> 
> On the hot-remove path, there is no real use case where the IOMMU is
> hot-removed, but the devices that it manages are still alive in the
> system. The translation data structures were torn down during device
> release, hence there's no need to repeat it in IOMMU hot-remove path
> either.

Can you leave behind a 1 statement WARN_ON of some kind to check this?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()
  2022-05-27  6:30 ` [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk() Lu Baolu
@ 2022-05-27 15:01   ` Jason Gunthorpe
  2022-06-01  8:56   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 15:01 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Fri, May 27, 2022 at 02:30:11PM +0800, Lu Baolu wrote:
> Use pci_get_domain_bus_and_slot() instead of searching the global list
> to retrieve the pci device pointer. This removes device_domain_list
> global list as there are no consumers anymore.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel/iommu.h |  1 -
>  drivers/iommu/intel/iommu.c | 33 ++++++---------------------------
>  2 files changed, 6 insertions(+), 28 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-05-27  6:30 ` [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path Lu Baolu
@ 2022-05-27 15:05   ` Jason Gunthorpe
  2022-06-01  9:28   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 15:05 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Fri, May 27, 2022 at 02:30:16PM +0800, Lu Baolu wrote:
> When the IOMMU domain is about to be freed, it should not be set on any
> device. Instead of silently dealing with some bug cases, it's better to
> trigger a warning to report and fix any potential bugs at the first time.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel/iommu.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-27 14:59   ` Jason Gunthorpe
@ 2022-05-29  5:14     ` Baolu Lu
  2022-05-30 12:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Baolu Lu @ 2022-05-29  5:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On 2022/5/27 22:59, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 02:30:08PM +0800, Lu Baolu wrote:
>> Retrieve the attached domain for a device through the generic interface
>> exposed by the iommu core. This also makes device_domain_lock static.
>>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>>   drivers/iommu/intel/iommu.h   |  1 -
>>   drivers/iommu/intel/debugfs.c | 20 ++++++++------------
>>   drivers/iommu/intel/iommu.c   |  2 +-
>>   3 files changed, 9 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
>> index a22adfbdf870..8a6d64d726c0 100644
>> +++ b/drivers/iommu/intel/iommu.h
>> @@ -480,7 +480,6 @@ enum {
>>   #define VTD_FLAG_SVM_CAPABLE		(1 << 2)
>>   
>>   extern int intel_iommu_sm;
>> -extern spinlock_t device_domain_lock;
>>   
>>   #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
>>   #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
>> diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
>> index d927ef10641b..eea8727aa7bc 100644
>> +++ b/drivers/iommu/intel/debugfs.c
>> @@ -344,19 +344,21 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,
>>   
>>   static int show_device_domain_translation(struct device *dev, void *data)
>>   {
>> -	struct device_domain_info *info = dev_iommu_priv_get(dev);
>> -	struct dmar_domain *domain = info->domain;
>> +	struct dmar_domain *dmar_domain;
>> +	struct iommu_domain *domain;
>>   	struct seq_file *m = data;
>>   	u64 path[6] = { 0 };
>>   
>> +	domain = iommu_get_domain_for_dev(dev);
>>   	if (!domain)
>>   		return 0;
> 
> The iommu_get_domain_for_dev() API should be called something like
> 'iommu_get_dma_api_domain()' and clearly documented that it is safe to
> call only so long as a DMA API using driver is attached to the device,
> which is most of the current callers.

Yes, agreed.

> This use in random sysfs inside the iommu driver is not OK because it
> doesn't have any locking protecting domain from concurrent free.

This is not sysfs, but debugfs. The description of this patch is
confusing. I should make it specific and straight-forward.

How about below one?

 From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
From: Lu Baolu <baolu.lu@linux.intel.com>
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage

The domain_translation_struct debugfs node is used to dump static
mappings of PCI devices. It potentially races with setting new
domains to devices and the iommu_map/unmap() interfaces. The existing
code tries to use the global spinlock device_domain_lock to avoid the
races, but this is problematical as this lock is only used to protect
the device tracking lists of the domains.

Instead of using an immature lock to cover up the problem, it's better
to explicitly restrict the use of this debugfs node. This also makes
device_domain_lock static.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
  drivers/iommu/intel/debugfs.c | 17 ++++++++---------
  drivers/iommu/intel/iommu.c   |  2 +-
  drivers/iommu/intel/iommu.h   |  1 -
  3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..9642e3e9d6b0 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -362,17 +362,16 @@ static int show_device_domain_translation(struct 
device *dev, void *data)
  	return 0;
  }

+/*
+ * Dump the static mappings of PCI devices. This is only for DEBUGFS code,
+ * don't use it for other purposes.  It potentially races with setting new
+ * domains to devices and iommu_map/unmap(). Use the trace events under
+ * /sys/kernel/debug/tracing/events/iommu/ for dynamic debugging.
+ */
  static int domain_translation_struct_show(struct seq_file *m, void 
*unused)
  {
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&device_domain_lock, flags);
-	ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-			       show_device_domain_translation);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
-
-	return ret;
+	return bus_for_each_dev(&pci_bus_type, NULL, m,
+				show_device_domain_translation);
  }
  DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1af4b6562266..cacae8bdaa65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
  #define IDENTMAP_GFX		2
  #define IDENTMAP_AZALIA		4

-DEFINE_SPINLOCK(device_domain_lock);
+static DEFINE_SPINLOCK(device_domain_lock);
  static LIST_HEAD(device_domain_list);

  /*
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a22adfbdf870..8a6d64d726c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -480,7 +480,6 @@ enum {
  #define VTD_FLAG_SVM_CAPABLE		(1 << 2)

  extern int intel_iommu_sm;
-extern spinlock_t device_domain_lock;

  #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
  #define pasid_supported(iommu)	(sm_supported(iommu) &&			\
-- 
2.25.1

Best regards,
baolu

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu()
  2022-05-27 15:01   ` Jason Gunthorpe
@ 2022-05-29  5:22     ` Baolu Lu
  0 siblings, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-05-29  5:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On 2022/5/27 23:01, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 02:30:10PM +0800, Lu Baolu wrote:
>> The disable_dmar_iommu() is called when IOMMU initialzation fails or
>> the IOMMU is hot-removed from the system. In both cases, there is no
>> need to clear the IOMMU translation data structures for devices.
>>
>> On the initialization path, the device probing only happens after the
>> IOMMU is initialized successfully, hence there're no translation data
>> structures.
>>
>> On the hot-remove path, there is no real use case where the IOMMU is
>> hot-removed, but the devices that it manages are still alive in the
>> system. The translation data structures were torn down during device
>> release, hence there's no need to repeat it in IOMMU hot-remove path
>> either.
> 
> Can you leave behind a 1 statement WARN_ON of some kind to check this?

Sure. As the default domain is the first domain allocated for a device
and the last one freed. We can WARN_ON the case where there's still
domain IDs in use. How do you like this?

+       /*
+        * All iommu domains must have been detached from the devices,
+        * hence there should be no domain IDs in use.
+        */
+       if (WARN_ON(bitmap_weight(iommu->domain_ids, cap_ndoms(iommu->cap))
+                   != NUM_RESERVED_DID))
+               return;

Best regards,
baolu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-29  5:14     ` Baolu Lu
@ 2022-05-30 12:14       ` Jason Gunthorpe
  2022-05-31  3:02         ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-30 12:14 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:

> From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Date: Sun, 29 May 2022 10:18:56 +0800
> Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage
> 
> The domain_translation_struct debugfs node is used to dump static
> mappings of PCI devices. It potentially races with setting new
> domains to devices and the iommu_map/unmap() interfaces. The existing
> code tries to use the global spinlock device_domain_lock to avoid the
> races, but this is problematical as this lock is only used to protect
> the device tracking lists of the domains.
> 
> Instead of using an immature lock to cover up the problem, it's better
> to explicitly restrict the use of this debugfs node. This also makes
> device_domain_lock static.

What does "explicitly restrict" mean?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-30 12:14       ` Jason Gunthorpe
@ 2022-05-31  3:02         ` Baolu Lu
  2022-05-31 13:10           ` Jason Gunthorpe
  2022-05-31 13:52           ` Robin Murphy
  0 siblings, 2 replies; 55+ messages in thread
From: Baolu Lu @ 2022-05-31  3:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On 2022/5/30 20:14, Jason Gunthorpe wrote:
> On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:
> 
>>  From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Date: Sun, 29 May 2022 10:18:56 +0800
>> Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage
>>
>> The domain_translation_struct debugfs node is used to dump static
>> mappings of PCI devices. It potentially races with setting new
>> domains to devices and the iommu_map/unmap() interfaces. The existing
>> code tries to use the global spinlock device_domain_lock to avoid the
>> races, but this is problematical as this lock is only used to protect
>> the device tracking lists of the domains.
>>
>> Instead of using an immature lock to cover up the problem, it's better
>> to explicitly restrict the use of this debugfs node. This also makes
>> device_domain_lock static.
> 
> What does "explicitly restrict" mean?

I originally thought about adding restrictions on this interface to a
document. But obviously, this is a naive idea. :-) I went over the code
again. The races exist in two paths:

1. Dump the page table in use while setting a new page table to the
    device.
2. A high-level page table entry has been marked as non-present, but the
    dumping code has walked down to the low-level tables.

For case 1, we can try to solve it by dumping tables while holding the
group->mutex.

For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.

The real code looks like this:

 From 3feb0727f9d7095729ef75ab1967270045b3a38c Mon Sep 17 00:00:00 2001
From: Lu Baolu <baolu.lu@linux.intel.com>
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage

The domain_translation_struct debugfs node is used to dump the DMAR page
tables for the PCI devices. It potentially races with setting domains to
devices and the iommu_unmap() interface. The existing code uses a global
spinlock device_domain_lock to avoid the races, but this is problematical
as this lock is only used to protect the device tracking lists of each
domain.

This replaces device_domain_lock with group->mutex to protect the traverse
of the page tables from setting a new domain and always check the physical
address retrieved from the page table entry before traversing to the next-
level page table.

As a cleanup, this also makes device_domain_lock static.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
  drivers/iommu/intel/debugfs.c | 42 ++++++++++++++++++++++-------------
  drivers/iommu/intel/iommu.c   |  2 +-
  drivers/iommu/intel/iommu.h   |  1 -
  3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..e6f4835b8d9f 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, 
struct dma_pte *pde,
  			continue;

  		path[level] = pde->val;
-		if (dma_pte_superpage(pde) || level == 1)
+		if (dma_pte_superpage(pde) || level == 1) {
  			dump_page_info(m, start, path);
-		else
-			pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
+		} else {
+			unsigned long phys_addr;
+
+			phys_addr = (unsigned long)dma_pte_addr(pde);
+			if (!pfn_valid(__phys_to_pfn(phys_addr)))
+				break;
+			pgtable_walk_level(m, phys_to_virt(phys_addr),
  					   level - 1, start, path);
+		}
  		path[level] = 0;
  	}
  }

-static int show_device_domain_translation(struct device *dev, void *data)
+static int __show_device_domain_translation(struct device *dev, void *data)
  {
  	struct device_domain_info *info = dev_iommu_priv_get(dev);
  	struct dmar_domain *domain = info->domain;
  	struct seq_file *m = data;
  	u64 path[6] = { 0 };

-	if (!domain)
-		return 0;
-
  	seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
  		   (u64)virt_to_phys(domain->pgd));
  	seq_puts(m, 
"IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
@@ -359,20 +362,27 @@ static int show_device_domain_translation(struct 
device *dev, void *data)
  	pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
  	seq_putc(m, '\n');

-	return 0;
+	return 1;
  }

-static int domain_translation_struct_show(struct seq_file *m, void *unused)
+static int show_device_domain_translation(struct device *dev, void *data)
  {
-	unsigned long flags;
-	int ret;
+	struct iommu_group *group;

-	spin_lock_irqsave(&device_domain_lock, flags);
-	ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-			       show_device_domain_translation);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
+	group = iommu_group_get(dev);
+	if (group) {
+		iommu_group_for_each_dev(group, data,
+					 __show_device_domain_translation);
+		iommu_group_put(group);
+	}

-	return ret;
+	return 0;
+}
+
+static int domain_translation_struct_show(struct seq_file *m, void *unused)
+{
+	return bus_for_each_dev(&pci_bus_type, NULL, m,
+				show_device_domain_translation);
  }
  DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1af4b6562266..cacae8bdaa65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
  #define IDENTMAP_GFX		2
  #define IDENTMAP_AZALIA		4

-DEFINE_SPINLOCK(device_domain_lock);
+static DEFINE_SPINLOCK(device_domain_lock);
  static LIST_HEAD(device_domain_list);

  /*
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a22adfbdf870..8a6d64d726c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -480,7 +480,6 @@ enum {
  #define VTD_FLAG_SVM_CAPABLE		(1 << 2)

  extern int intel_iommu_sm;
-extern spinlock_t device_domain_lock;

  #define sm_supported(iommu)	(intel_iommu_sm && ecap_smts((iommu)->ecap))
  #define pasid_supported(iommu)	(sm_supported(iommu) &&

Best regards,
baolu

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31  3:02         ` Baolu Lu
@ 2022-05-31 13:10           ` Jason Gunthorpe
  2022-05-31 14:11             ` Baolu Lu
  2022-05-31 13:52           ` Robin Murphy
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 13:10 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:

> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> iommu_unmap() and dumping tables in debugfs exclusive. This does not
> work because debugfs may depend on the DMA of the devices to work. It
> seems that what we can do is to allow this race, but when we traverse
> the page table in debugfs, we will check the validity of the physical
> address retrieved from the page table entry. Then, the worst case is to
> print some useless information.

Sounds horrible, don't you have locking around the IOPTEs of some
kind? How does updating them work reliably?

It is just debugfs, so maybe it is not the end of the world, but
still..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31  3:02         ` Baolu Lu
  2022-05-31 13:10           ` Jason Gunthorpe
@ 2022-05-31 13:52           ` Robin Murphy
  2022-05-31 15:59             ` Jason Gunthorpe
  2022-06-01  5:33             ` Baolu Lu
  1 sibling, 2 replies; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 13:52 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022-05-31 04:02, Baolu Lu wrote:
> On 2022/5/30 20:14, Jason Gunthorpe wrote:
>> On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:
>>
>>>  From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>> Date: Sun, 29 May 2022 10:18:56 +0800
>>> Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock 
>>> usage
>>>
>>> The domain_translation_struct debugfs node is used to dump static
>>> mappings of PCI devices. It potentially races with setting new
>>> domains to devices and the iommu_map/unmap() interfaces. The existing
>>> code tries to use the global spinlock device_domain_lock to avoid the
>>> races, but this is problematical as this lock is only used to protect
>>> the device tracking lists of the domains.
>>>
>>> Instead of using an immature lock to cover up the problem, it's better
>>> to explicitly restrict the use of this debugfs node. This also makes
>>> device_domain_lock static.
>>
>> What does "explicitly restrict" mean?
> 
> I originally thought about adding restrictions on this interface to a
> document. But obviously, this is a naive idea. :-) I went over the code
> again. The races exist in two paths:
> 
> 1. Dump the page table in use while setting a new page table to the
>     device.
> 2. A high-level page table entry has been marked as non-present, but the
>     dumping code has walked down to the low-level tables.
> 
> For case 1, we can try to solve it by dumping tables while holding the
> group->mutex.
> 
> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> iommu_unmap() and dumping tables in debugfs exclusive. This does not
> work because debugfs may depend on the DMA of the devices to work. It
> seems that what we can do is to allow this race, but when we traverse
> the page table in debugfs, we will check the validity of the physical
> address retrieved from the page table entry. Then, the worst case is to
> print some useless information.
> 
> The real code looks like this:
> 
>  From 3feb0727f9d7095729ef75ab1967270045b3a38c Mon Sep 17 00:00:00 2001
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Date: Sun, 29 May 2022 10:18:56 +0800
> Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage
> 
> The domain_translation_struct debugfs node is used to dump the DMAR page
> tables for the PCI devices. It potentially races with setting domains to
> devices and the iommu_unmap() interface. The existing code uses a global
> spinlock device_domain_lock to avoid the races, but this is problematical
> as this lock is only used to protect the device tracking lists of each
> domain.
> 
> This replaces device_domain_lock with group->mutex to protect the traverse
> of the page tables from setting a new domain and always check the physical
> address retrieved from the page table entry before traversing to the next-
> level page table.
> 
> As a cleanup, this also makes device_domain_lock static.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>   drivers/iommu/intel/debugfs.c | 42 ++++++++++++++++++++++-------------
>   drivers/iommu/intel/iommu.c   |  2 +-
>   drivers/iommu/intel/iommu.h   |  1 -
>   3 files changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
> index d927ef10641b..e6f4835b8d9f 100644
> --- a/drivers/iommu/intel/debugfs.c
> +++ b/drivers/iommu/intel/debugfs.c
> @@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, 
> struct dma_pte *pde,
>               continue;
> 
>           path[level] = pde->val;
> -        if (dma_pte_superpage(pde) || level == 1)
> +        if (dma_pte_superpage(pde) || level == 1) {
>               dump_page_info(m, start, path);
> -        else
> -            pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
> +        } else {
> +            unsigned long phys_addr;
> +
> +            phys_addr = (unsigned long)dma_pte_addr(pde);
> +            if (!pfn_valid(__phys_to_pfn(phys_addr)))

Given that pte_present(pde) passed just above, it was almost certainly a 
valid entry, so it seems unlikely that the physical address it pointed 
to could have disappeared in the meantime. If you're worried about the 
potential case where we've been preempted during this walk for long 
enough that the page has already been freed by an unmap, reallocated, 
and filled with someone else's data that happens to look like valid 
PTEs, this still isn't enough, since that data could just as well happen 
to look like valid physical addresses too.

I imagine that if you want to safely walk pagetables concurrently with 
them potentially being freed, you'd probably need to get RCU involved.

> +                break;
> +            pgtable_walk_level(m, phys_to_virt(phys_addr),

Also, obligatory reminder that pfn_valid() only means that pfn_to_page() 
gets you a valid struct page. Whether that page is direct-mapped kernel 
memory or not is a different matter.

>                          level - 1, start, path);
> +        }
>           path[level] = 0;
>       }
>   }
> 
> -static int show_device_domain_translation(struct device *dev, void *data)
> +static int __show_device_domain_translation(struct device *dev, void 
> *data)
>   {
>       struct device_domain_info *info = dev_iommu_priv_get(dev);
>       struct dmar_domain *domain = info->domain;
>       struct seq_file *m = data;
>       u64 path[6] = { 0 };
> 
> -    if (!domain)
> -        return 0;
> -
>       seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
>              (u64)virt_to_phys(domain->pgd));
>       seq_puts(m, 
> "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
> @@ -359,20 +362,27 @@ static int show_device_domain_translation(struct 
> device *dev, void *data)
>       pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
>       seq_putc(m, '\n');
> 
> -    return 0;
> +    return 1;
>   }
> 
> -static int domain_translation_struct_show(struct seq_file *m, void 
> *unused)
> +static int show_device_domain_translation(struct device *dev, void *data)
>   {
> -    unsigned long flags;
> -    int ret;
> +    struct iommu_group *group;
> 
> -    spin_lock_irqsave(&device_domain_lock, flags);
> -    ret = bus_for_each_dev(&pci_bus_type, NULL, m,
> -                   show_device_domain_translation);
> -    spin_unlock_irqrestore(&device_domain_lock, flags);
> +    group = iommu_group_get(dev);
> +    if (group) {
> +        iommu_group_for_each_dev(group, data,
> +                     __show_device_domain_translation);

Why group_for_each_dev? If there *are* multiple devices in the group 
then by definition they should be attached to the same domain, so 
dumping that domain's mappings more than once seems pointless. 
Especially given that the outer bus_for_each_dev iteration will already 
visit each individual device anyway, so this would only make the 
redundancy even worse than it already is.

Robin.

> +        iommu_group_put(group);
> +    }
> 
> -    return ret;
> +    return 0;
> +}
> +
> +static int domain_translation_struct_show(struct seq_file *m, void 
> *unused)
> +{
> +    return bus_for_each_dev(&pci_bus_type, NULL, m,
> +                show_device_domain_translation);
>   }
>   DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 1af4b6562266..cacae8bdaa65 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
>   #define IDENTMAP_GFX        2
>   #define IDENTMAP_AZALIA        4
> 
> -DEFINE_SPINLOCK(device_domain_lock);
> +static DEFINE_SPINLOCK(device_domain_lock);
>   static LIST_HEAD(device_domain_list);
> 
>   /*
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index a22adfbdf870..8a6d64d726c0 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -480,7 +480,6 @@ enum {
>   #define VTD_FLAG_SVM_CAPABLE        (1 << 2)
> 
>   extern int intel_iommu_sm;
> -extern spinlock_t device_domain_lock;
> 
>   #define sm_supported(iommu)    (intel_iommu_sm && 
> ecap_smts((iommu)->ecap))
>   #define pasid_supported(iommu)    (sm_supported(iommu) &&
> 
> Best regards,
> baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 13:10           ` Jason Gunthorpe
@ 2022-05-31 14:11             ` Baolu Lu
  2022-05-31 14:53               ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Baolu Lu @ 2022-05-31 14:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On 2022/5/31 21:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> 
>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>> work because debugfs may depend on the DMA of the devices to work. It
>> seems that what we can do is to allow this race, but when we traverse
>> the page table in debugfs, we will check the validity of the physical
>> address retrieved from the page table entry. Then, the worst case is to
>> print some useless information.
> 
> Sounds horrible, don't you have locking around the IOPTEs of some
> kind? How does updating them work reliably?

There's no locking around updating the IOPTEs. The basic assumption is
that at any time, there's only a single thread manipulating the mappings
of the range specified in iommu_map/unmap() APIs. Therefore, the race
only exists when multiple ranges share some high-level IOPTEs. The IOMMU
driver updates those IOPTEs using the compare-and-exchange atomic
operation.

> 
> It is just debugfs, so maybe it is not the end of the world, but
> still..

Fair enough. I think this is somewhat similar to that IOMMU hardware can
traverse the page table at any time without considering when the CPUs
update it. The IOMMU hardware will generate faults when it encounters
failure during the traverse of page table. Similarly, perhaps debugfs
could dump all-ones for an invalid IOPTE?

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 14:11             ` Baolu Lu
@ 2022-05-31 14:53               ` Jason Gunthorpe
  2022-05-31 15:01                 ` Robin Murphy
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 14:53 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Robin Murphy, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> > 
> > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > work because debugfs may depend on the DMA of the devices to work. It
> > > seems that what we can do is to allow this race, but when we traverse
> > > the page table in debugfs, we will check the validity of the physical
> > > address retrieved from the page table entry. Then, the worst case is to
> > > print some useless information.
> > 
> > Sounds horrible, don't you have locking around the IOPTEs of some
> > kind? How does updating them work reliably?
> 
> There's no locking around updating the IOPTEs. The basic assumption is
> that at any time, there's only a single thread manipulating the mappings
> of the range specified in iommu_map/unmap() APIs. Therefore, the race
> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> driver updates those IOPTEs using the compare-and-exchange atomic
> operation.

Oh? Did I miss where that was documented as part of the iommu API?

Daniel posted patches for VFIO to multi-thread iommu_domin mapping.

iommufd goes out of its way to avoid this kind of serialization so
that userspace can parallel map IOVA.

I think if this is the requirement then the iommu API needs to
provide a lock around the domain for the driver..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 14:53               ` Jason Gunthorpe
@ 2022-05-31 15:01                 ` Robin Murphy
  2022-05-31 15:13                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 15:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Baolu Lu
  Cc: Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022-05-31 15:53, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
>> On 2022/5/31 21:10, Jason Gunthorpe wrote:
>>> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
>>>
>>>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>>>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>>>> work because debugfs may depend on the DMA of the devices to work. It
>>>> seems that what we can do is to allow this race, but when we traverse
>>>> the page table in debugfs, we will check the validity of the physical
>>>> address retrieved from the page table entry. Then, the worst case is to
>>>> print some useless information.
>>>
>>> Sounds horrible, don't you have locking around the IOPTEs of some
>>> kind? How does updating them work reliably?
>>
>> There's no locking around updating the IOPTEs. The basic assumption is
>> that at any time, there's only a single thread manipulating the mappings
>> of the range specified in iommu_map/unmap() APIs. Therefore, the race
>> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
>> driver updates those IOPTEs using the compare-and-exchange atomic
>> operation.
> 
> Oh? Did I miss where that was documented as part of the iommu API?
> 
> Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
> 
> iommufd goes out of its way to avoid this kind of serialization so
> that userspace can parallel map IOVA.
> 
> I think if this is the requirement then the iommu API needs to
> provide a lock around the domain for the driver..

Eww, no, we can't kill performance by forcing serialisation on the 
entire API just for one silly driver-internal debugfs corner :(

Robin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 15:01                 ` Robin Murphy
@ 2022-05-31 15:13                   ` Jason Gunthorpe
  2022-05-31 16:01                     ` Robin Murphy
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 15:13 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:
> On 2022-05-31 15:53, Jason Gunthorpe wrote:
> > On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
> > > On 2022/5/31 21:10, Jason Gunthorpe wrote:
> > > > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
> > > > 
> > > > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the
> > > > > iommu_unmap() and dumping tables in debugfs exclusive. This does not
> > > > > work because debugfs may depend on the DMA of the devices to work. It
> > > > > seems that what we can do is to allow this race, but when we traverse
> > > > > the page table in debugfs, we will check the validity of the physical
> > > > > address retrieved from the page table entry. Then, the worst case is to
> > > > > print some useless information.
> > > > 
> > > > Sounds horrible, don't you have locking around the IOPTEs of some
> > > > kind? How does updating them work reliably?
> > > 
> > > There's no locking around updating the IOPTEs. The basic assumption is
> > > that at any time, there's only a single thread manipulating the mappings
> > > of the range specified in iommu_map/unmap() APIs. Therefore, the race
> > > only exists when multiple ranges share some high-level IOPTEs. The IOMMU
> > > driver updates those IOPTEs using the compare-and-exchange atomic
> > > operation.
> > 
> > Oh? Did I miss where that was documented as part of the iommu API?
> > 
> > Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
> > 
> > iommufd goes out of its way to avoid this kind of serialization so
> > that userspace can parallel map IOVA.
> > 
> > I think if this is the requirement then the iommu API needs to
> > provide a lock around the domain for the driver..
> 
> Eww, no, we can't kill performance by forcing serialisation on the entire
> API just for one silly driver-internal debugfs corner :(

I'm not worried about debugfs, I'm worried about these efforts to
speed up VFIO VM booting by parallel domain loading:

https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jordan@oracle.com/

The DMA API should maintain its own external lock, but the general
domain interface to the rest of the kernel should be safe, IMHO.

Or at least it should be documented..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 13:52           ` Robin Murphy
@ 2022-05-31 15:59             ` Jason Gunthorpe
  2022-05-31 16:42               ` Robin Murphy
  2022-06-01  5:47               ` Baolu Lu
  2022-06-01  5:33             ` Baolu Lu
  1 sibling, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 15:59 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Baolu Lu, Kevin Tian, Ashok Raj, linux-kernel, Christoph Hellwig,
	iommu, Jacob jun Pan, Will Deacon

On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:

> > +                break;
> > +            pgtable_walk_level(m, phys_to_virt(phys_addr),
> 
> Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
> gets you a valid struct page. Whether that page is direct-mapped kernel
> memory or not is a different matter.

Even though this is debugfs, if the operation is sketchy like that and
can theortically crash the kernel the driver should test capabilities,
CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
better cap for 'userspace may crash the kernel'

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 15:13                   ` Jason Gunthorpe
@ 2022-05-31 16:01                     ` Robin Murphy
  2022-05-31 16:21                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 16:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022-05-31 16:13, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote:
>> On 2022-05-31 15:53, Jason Gunthorpe wrote:
>>> On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote:
>>>> On 2022/5/31 21:10, Jason Gunthorpe wrote:
>>>>> On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote:
>>>>>
>>>>>> For case 2, it is a bit weird. I tried to add a rwsem lock to make the
>>>>>> iommu_unmap() and dumping tables in debugfs exclusive. This does not
>>>>>> work because debugfs may depend on the DMA of the devices to work. It
>>>>>> seems that what we can do is to allow this race, but when we traverse
>>>>>> the page table in debugfs, we will check the validity of the physical
>>>>>> address retrieved from the page table entry. Then, the worst case is to
>>>>>> print some useless information.
>>>>>
>>>>> Sounds horrible, don't you have locking around the IOPTEs of some
>>>>> kind? How does updating them work reliably?
>>>>
>>>> There's no locking around updating the IOPTEs. The basic assumption is
>>>> that at any time, there's only a single thread manipulating the mappings
>>>> of the range specified in iommu_map/unmap() APIs. Therefore, the race
>>>> only exists when multiple ranges share some high-level IOPTEs. The IOMMU
>>>> driver updates those IOPTEs using the compare-and-exchange atomic
>>>> operation.
>>>
>>> Oh? Did I miss where that was documented as part of the iommu API?
>>>
>>> Daniel posted patches for VFIO to multi-thread iommu_domin mapping.
>>>
>>> iommufd goes out of its way to avoid this kind of serialization so
>>> that userspace can parallel map IOVA.
>>>
>>> I think if this is the requirement then the iommu API needs to
>>> provide a lock around the domain for the driver..
>>
>> Eww, no, we can't kill performance by forcing serialisation on the entire
>> API just for one silly driver-internal debugfs corner :(
> 
> I'm not worried about debugfs, I'm worried about these efforts to
> speed up VFIO VM booting by parallel domain loading:
> 
> https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jordan@oracle.com/
> 
> The DMA API should maintain its own external lock, but the general
> domain interface to the rest of the kernel should be safe, IMHO.

The DMA API doesn't need locking, partly since it can trust itself not 
to do stupid things, and mostly because it's DMA API performance that's 
fundamentally incompatible with serialisation anyway. Why do you think 
we have a complicated per-CPU IOVA caching mechanism, if not to support 
big multi-queue devices with multiple CPU threads mapping/unmapping in 
different parts of the same DMA domain concurrently?

> Or at least it should be documented..

As far as I'm aware there's never been any restriction at the IOMMU API 
level. It should be self-evident that racing concurrent 
iommu_{map,unmap} requests against iommu_domain_free(), or against each 
other for overlapping IOVAs, is a bad idea and don't do that if you want 
a well-defined outcome. The simpler drivers already serialise on a 
per-domain lock internally, while the more performance-focused ones 
implement lock-free atomic pagetable management in a similar style to 
CPU arch code; either way it should work fine as-is. The difference with 
debugfs is that it's a completely orthogonal side-channel - an 
iommu_domain user like VFIO or iommu-dma can make sure its *own* API 
usage is sane, but can't be aware of the user triggering some 
driver-internal introspection of that domain in a manner that could race 
more harmfully. It has to be down to individual drivers to make that 
safe if they choose to expose such an interface.

Robin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 16:01                     ` Robin Murphy
@ 2022-05-31 16:21                       ` Jason Gunthorpe
  2022-05-31 18:07                         ` Robin Murphy
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 16:21 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:

> The DMA API doesn't need locking, partly since it can trust itself not to do
> stupid things, and mostly because it's DMA API performance that's
> fundamentally incompatible with serialisation anyway. Why do you think we
> have a complicated per-CPU IOVA caching mechanism, if not to support big
> multi-queue devices with multiple CPU threads mapping/unmapping in different
> parts of the same DMA domain concurrently?

Well, per-CPU is a form of locking.

So what are the actual locking rules here? We can call map/unmap
concurrently but not if ... ?

IOVA overlaps?

And we expect the iommu driver to be unable to free page table levels
that have IOVA boundaries in them?

> The simpler drivers already serialise on a per-domain lock internally, while
> the more performance-focused ones implement lock-free atomic pagetable
> management in a similar style to CPU arch code; either way it should work
> fine as-is.

The mm has page table locks at every level and generally expects them
to be held for a lot of manipulations. There are some lockless cases,
but it is not as aggressive as this sounds.

> The difference with debugfs is that it's a completely orthogonal
> side-channel - an iommu_domain user like VFIO or iommu-dma can make sure its
> *own* API usage is sane, but can't be aware of the user triggering some
> driver-internal introspection of that domain in a manner that could race
> more harmfully. 

The mm solution to this problem is to RCU free the page table
levels. This way something like debugfs can read a page table under
RCU completely safely, though incoherently, and there is no
performance cost on the map/unmap fast path side.

Today struct page has a rcu_head that can be used to rcu free it, so
it costs nothing.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 15:59             ` Jason Gunthorpe
@ 2022-05-31 16:42               ` Robin Murphy
  2022-06-01  5:47               ` Baolu Lu
  1 sibling, 0 replies; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Kevin Tian, Ashok Raj, linux-kernel, Christoph Hellwig,
	iommu, Jacob jun Pan, Will Deacon

On 2022-05-31 16:59, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:
> 
>>> +                break;
>>> +            pgtable_walk_level(m, phys_to_virt(phys_addr),
>>
>> Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
>> gets you a valid struct page. Whether that page is direct-mapped kernel
>> memory or not is a different matter.
> 
> Even though this is debugfs, if the operation is sketchy like that and
> can theortically crash the kernel the driver should test capabilities,
> CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
> better cap for 'userspace may crash the kernel'

It shouldn't be insurmountable to make this safe, it just needs a bit 
more than pfn_valid(), which can still return true off the ends of the 
memory map if they're not perfectly section-aligned, and for random 
reserved holes in the middle.

Robin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 16:21                       ` Jason Gunthorpe
@ 2022-05-31 18:07                         ` Robin Murphy
  2022-05-31 18:51                           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 18:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022-05-31 17:21, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote:
> 
>> The DMA API doesn't need locking, partly since it can trust itself not to do
>> stupid things, and mostly because it's DMA API performance that's
>> fundamentally incompatible with serialisation anyway. Why do you think we
>> have a complicated per-CPU IOVA caching mechanism, if not to support big
>> multi-queue devices with multiple CPU threads mapping/unmapping in different
>> parts of the same DMA domain concurrently?
> 
> Well, per-CPU is a form of locking.

Right, but that only applies for alloc_iova_fast() itself - once the 
CPUs have each got their distinct IOVA region, they can then pile into 
iommu_map() completely unhindered (and the inverse for the unmap path).

> So what are the actual locking rules here? We can call map/unmap
> concurrently but not if ... ?
> 
> IOVA overlaps?

Well, I think the de-facto rule is that you technically *can* make 
overlapping requests, but one or both may fail, and the final outcome in 
terms of what ends up mapped in the domain is undefined (especially if 
they both succeed). The only real benefit of enforcing serialisation 
would be better failure behaviour in such cases, but it remains 
fundamentally nonsensical for callers to make contradictory requests 
anyway, whether concurrently or sequentially, so there doesn't seem much 
point in spending effort on improving support for nonsense.

> And we expect the iommu driver to be unable to free page table levels
> that have IOVA boundaries in them?

I'm not entirely sure what you mean there, but in general an unmap 
request is expected to match some previous map request - there isn't a 
defined API-level behaviour for partial unmaps. They might either unmap 
the entire region originally mapped, or just the requested part, or 
might fail entirely (IIRC there was some nasty code in VFIO for 
detecting a particular behaviour). Similarly for unmapping anything 
that's already not mapped, some drivers treat that as a no-op, others as 
an error. But again, this is even further unrelated to concurrency.

>> The simpler drivers already serialise on a per-domain lock internally, while
>> the more performance-focused ones implement lock-free atomic pagetable
>> management in a similar style to CPU arch code; either way it should work
>> fine as-is.
> 
> The mm has page table locks at every level and generally expects them
> to be held for a lot of manipulations. There are some lockless cases,
> but it is not as aggressive as this sounds.

Oh, I've spent the last couple of weeks hacking up horrible things 
manipulating entries in init_mm, and never realised that that was 
actually the special case. Oh well, live and learn.

Robin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 18:07                         ` Robin Murphy
@ 2022-05-31 18:51                           ` Jason Gunthorpe
  2022-05-31 21:22                             ` Robin Murphy
  2022-06-01  6:39                             ` Baolu Lu
  0 siblings, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 18:51 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:

> > And we expect the iommu driver to be unable to free page table levels
> > that have IOVA boundaries in them?
> 
> I'm not entirely sure what you mean there, but in general an unmap request
> is expected to match some previous map request 

atomic cmpxchg is OK for inserting new page table levels but it can't
protect you against concurrent freeing of page table levels. So
without locks it means that page tables can't usually be freed. Which
seems to match what the Intel driver does - at least from a cursory
look.

This is one of the reasons the mm has the mmap/etc lock and spinlocks
because we do expect page table levels to get wiped out when VMA's are
zap'd - all the different locks provide the protection against page
tables disappearing under from something manipulating them.

Basically every "lockless" walk in (process) MM land is actually
protected by some kind of lock that blocks zap_page_range() from
removing the page table levels themselves.

> They might either unmap the entire region originally mapped, or just
> the requested part, or might fail entirely (IIRC there was some
> nasty code in VFIO for detecting a particular behaviour).

This is something I did differently in iommufd. It always generates
unmaps that are strict supersets of the maps it issued. So it would be
a kernel bug if the driver unmaps more or less than requested.

> Oh, I've spent the last couple of weeks hacking up horrible things
> manipulating entries in init_mm, and never realised that that was actually
> the special case. Oh well, live and learn.

The init_mm is sort of different, it doesn't have zap in quite the
same way, for example. I was talking about the typical process mm.

Anyhow, the right solution is to use RCU as I described before, Baolu
do you want to try?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 18:51                           ` Jason Gunthorpe
@ 2022-05-31 21:22                             ` Robin Murphy
  2022-05-31 23:10                               ` Jason Gunthorpe
  2022-06-01  6:39                             ` Baolu Lu
  1 sibling, 1 reply; 55+ messages in thread
From: Robin Murphy @ 2022-05-31 21:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022-05-31 19:51, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote:
> 
>>> And we expect the iommu driver to be unable to free page table levels
>>> that have IOVA boundaries in them?
>>
>> I'm not entirely sure what you mean there, but in general an unmap request
>> is expected to match some previous map request
> 
> atomic cmpxchg is OK for inserting new page table levels but it can't
> protect you against concurrent freeing of page table levels. So
> without locks it means that page tables can't usually be freed. Which
> seems to match what the Intel driver does - at least from a cursory
> look.
> 
> This is one of the reasons the mm has the mmap/etc lock and spinlocks
> because we do expect page table levels to get wiped out when VMA's are
> zap'd - all the different locks provide the protection against page
> tables disappearing under from something manipulating them.
> 
> Basically every "lockless" walk in (process) MM land is actually
> protected by some kind of lock that blocks zap_page_range() from
> removing the page table levels themselves.

I'm not an expert in the Intel or AMD code, so I can only speak with 
confidence about what we do in io-pgtable-arm, but the main reason for 
not freeing pagetables is that it's simply not worth the bother of 
trying to work out whether a whole sub-tree is empty. Not to mention 
whether it's *still* empty by the time that we may have figured out that 
it was.

There are only 3 instances where we'll free a table while the domain is 
live. The first is the one legitimate race condition, where two map 
requests targeting relatively nearby PTEs both go to fill in an 
intermediate level of table; whoever loses that race frees the table 
they allocated, but it was never visible to anyone else so that's 
definitely fine. The second is if we're mapping a block entry, and find 
that there's already a table entry there, wherein we assume the table 
must be empty, clear the entry, invalidate any walk caches, install the 
block entry, then free the orphaned table; since we're mapping the 
entire IOVA range covered by that table, there should be no other 
operations on that IOVA range attempting to walk the table at the same 
time, so it's fine. The third is effectively the inverse, if we get a 
block-sized unmap but find a table entry rather than a block at that 
point (on the assumption that it's de-facto allowed for a single unmap 
to cover multiple adjacent mappings as long as it does so exactly); 
similarly we assume that the table must be full, and no other operations 
should be racing because we're unmapping its whole IOVA range, so we 
remove the table entry, invalidate, and free as before.

Again for efficiency reasons we don't attempt to validate those 
assumptions by inspecting the freed tables, so odd behaviour can fall 
out if the caller *does* do something bogus. For example if two calls 
race to map a block and a page in the same (unmapped) region, the block 
mapping will always succeed (and be what ends up in the final pagetable 
state), but the page mapping may or may not report failure depending on 
the exact timing.

Although we don't have debug dumping for io-pgtable-arm, it's good to be 
thinking about this, since it's made me realise that dirty-tracking 
sweeps per that proposal might pose a similar kind of concern, so we 
might still need to harden these corners for the sake of that. Which 
also reminds me that somewhere I have some half-finished patches making 
io-pgtable-arm use the iommu_iotlb_gather freelist, so maybe I'll tackle 
both concerns at once (perhaps we might even be able to RCU-ify the 
freelist generically? I'll see how it goes when I get there).

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 21:22                             ` Robin Murphy
@ 2022-05-31 23:10                               ` Jason Gunthorpe
  2022-06-01  8:53                                 ` Tian, Kevin
  2022-06-01 12:18                                 ` Joao Martins
  0 siblings, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 23:10 UTC (permalink / raw)
  To: Robin Murphy, Joao Martins
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:

> There are only 3 instances where we'll free a table while the domain is
> live. The first is the one legitimate race condition, where two map requests
> targeting relatively nearby PTEs both go to fill in an intermediate level of
> table; whoever loses that race frees the table they allocated, but it was
> never visible to anyone else so that's definitely fine. The second is if
> we're mapping a block entry, and find that there's already a table entry
> there, wherein we assume the table must be empty, clear the entry,
> invalidate any walk caches, install the block entry, then free the orphaned
> table; since we're mapping the entire IOVA range covered by that table,
> there should be no other operations on that IOVA range attempting to walk
> the table at the same time, so it's fine. 

I saw these two in the Intel driver

> The third is effectively the inverse, if we get a block-sized unmap
> but find a table entry rather than a block at that point (on the
> assumption that it's de-facto allowed for a single unmap to cover
> multiple adjacent mappings as long as it does so exactly); similarly
> we assume that the table must be full, and no other operations
> should be racing because we're unmapping its whole IOVA range, so we
> remove the table entry, invalidate, and free as before.

Not sure I noticed this one though

This all it all makes sense though.

> Although we don't have debug dumping for io-pgtable-arm, it's good to be
> thinking about this, since it's made me realise that dirty-tracking sweeps
> per that proposal might pose a similar kind of concern, so we might still
> need to harden these corners for the sake of that.

Lets make sure Joao sees this..

It is interesting because we probably don't want the big latency
spikes that would come from using locking to block map/unmap while
dirty reading is happening - eg at the iommfd level.

From a consistency POV the only case that matters is unmap and unmap
should already be doing a dedicated dirty read directly prior to the
unmap (as per that other long thread)

So having safe racy reading in the kernel is probably best, and so RCU
would be good here too.

> that somewhere I have some half-finished patches making io-pgtable-arm use
> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
> how it goes when I get there).

This is all very similar to how the mm's tlb gather stuff works,
especially on PPC which requires RCU protection of page tables instead
of the IPI trick.

Currently the rcu_head and the lru overlap in the struct page. To do
this I'd suggest following what was done for slab - see ca1a46d6f506
("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
like 'struct slab' for use with iommu page tables and put the
list_head and rcu_head sequentially in the struct iommu_pt_page.

Continue to use put_pages_list() just with the list_head in the new
struct not the lru.

If we need it for dirty tracking then it makes sense to start
progressing parts at least..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 13:52           ` Robin Murphy
  2022-05-31 15:59             ` Jason Gunthorpe
@ 2022-06-01  5:33             ` Baolu Lu
  1 sibling, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-01  5:33 UTC (permalink / raw)
  To: Robin Murphy, Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

Hi Robin,

Thank you for the comments.

On 2022/5/31 21:52, Robin Murphy wrote:
> On 2022-05-31 04:02, Baolu Lu wrote:
>> On 2022/5/30 20:14, Jason Gunthorpe wrote:
>>> On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:

[--snip--]

>> diff --git a/drivers/iommu/intel/debugfs.c 
>> b/drivers/iommu/intel/debugfs.c
>> index d927ef10641b..e6f4835b8d9f 100644
>> --- a/drivers/iommu/intel/debugfs.c
>> +++ b/drivers/iommu/intel/debugfs.c
>> @@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file 
>> *m, struct dma_pte *pde,
>>               continue;
>>
>>           path[level] = pde->val;
>> -        if (dma_pte_superpage(pde) || level == 1)
>> +        if (dma_pte_superpage(pde) || level == 1) {
>>               dump_page_info(m, start, path);
>> -        else
>> -            pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
>> +        } else {
>> +            unsigned long phys_addr;
>> +
>> +            phys_addr = (unsigned long)dma_pte_addr(pde);
>> +            if (!pfn_valid(__phys_to_pfn(phys_addr)))
> 
> Given that pte_present(pde) passed just above, it was almost certainly a 
> valid entry, so it seems unlikely that the physical address it pointed 
> to could have disappeared in the meantime. If you're worried about the 
> potential case where we've been preempted during this walk for long 
> enough that the page has already been freed by an unmap, reallocated, 

Yes. This is exactly what I am worried about and what this patch wants
to solve.

> and filled with someone else's data that happens to look like valid 
> PTEs, this still isn't enough, since that data could just as well happen 
> to look like valid physical addresses too.
> I imagine that if you want to safely walk pagetables concurrently with 
> them potentially being freed, you'd probably need to get RCU involved.

I don't want to make the map/unmap interface more complex or inefficient
because of a debugfs feature. I hope that the debugfs and map/unmap
interfaces are orthogonal, just like the IOMMU hardware traversing the
page tables, as long as the accessed physical address is valid and
accessible. Otherwise, stop the traversal immediately. If we can't
achieve this, I'd rather stop supporting this debugfs node.

> 
>> +                break;
>> +            pgtable_walk_level(m, phys_to_virt(phys_addr),
> 
> Also, obligatory reminder that pfn_valid() only means that pfn_to_page() 
> gets you a valid struct page. Whether that page is direct-mapped kernel 
> memory or not is a different matter.

Perhaps I can check this from the page flags?

> 
>>                          level - 1, start, path);
>> +        }
>>           path[level] = 0;
>>       }
>>   }
>>
>> -static int show_device_domain_translation(struct device *dev, void 
>> *data)
>> +static int __show_device_domain_translation(struct device *dev, void 
>> *data)
>>   {
>>       struct device_domain_info *info = dev_iommu_priv_get(dev);
>>       struct dmar_domain *domain = info->domain;
>>       struct seq_file *m = data;
>>       u64 path[6] = { 0 };
>>
>> -    if (!domain)
>> -        return 0;
>> -
>>       seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
>>              (u64)virt_to_phys(domain->pgd));
>>       seq_puts(m, 
>> "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
>> @@ -359,20 +362,27 @@ static int show_device_domain_translation(struct 
>> device *dev, void *data)
>>       pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
>>       seq_putc(m, '\n');
>>
>> -    return 0;
>> +    return 1;
>>   }
>>
>> -static int domain_translation_struct_show(struct seq_file *m, void 
>> *unused)
>> +static int show_device_domain_translation(struct device *dev, void 
>> *data)
>>   {
>> -    unsigned long flags;
>> -    int ret;
>> +    struct iommu_group *group;
>>
>> -    spin_lock_irqsave(&device_domain_lock, flags);
>> -    ret = bus_for_each_dev(&pci_bus_type, NULL, m,
>> -                   show_device_domain_translation);
>> -    spin_unlock_irqrestore(&device_domain_lock, flags);
>> +    group = iommu_group_get(dev);
>> +    if (group) {
>> +        iommu_group_for_each_dev(group, data,
>> +                     __show_device_domain_translation);
> 
> Why group_for_each_dev?

This will hold the group mutex when the callback is invoked. With the
group mutex hold, the domain could not get changed.

> If there *are* multiple devices in the group 
> then by definition they should be attached to the same domain, so 
> dumping that domain's mappings more than once seems pointless. 
> Especially given that the outer bus_for_each_dev iteration will already 
> visit each individual device anyway, so this would only make the 
> redundancy even worse than it already is.

__show_device_domain_translation() only dumps mappings once as it always
returns 1.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 15:59             ` Jason Gunthorpe
  2022-05-31 16:42               ` Robin Murphy
@ 2022-06-01  5:47               ` Baolu Lu
  1 sibling, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-01  5:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy
  Cc: baolu.lu, Kevin Tian, Ashok Raj, linux-kernel, Christoph Hellwig,
	iommu, Jacob jun Pan, Will Deacon

On 2022/5/31 23:59, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote:
> 
>>> +                break;
>>> +            pgtable_walk_level(m, phys_to_virt(phys_addr),
>>
>> Also, obligatory reminder that pfn_valid() only means that pfn_to_page()
>> gets you a valid struct page. Whether that page is direct-mapped kernel
>> memory or not is a different matter.
> 
> Even though this is debugfs, if the operation is sketchy like that and
> can theortically crash the kernel the driver should test capabilities,
> CAP_SYS_RAWIO or something may be appropriate. I don't think we have a
> better cap for 'userspace may crash the kernel'

Yes. We should test both CAP_SYS_ADMIN and CAP_SYS_RAWIO.

Best regards,
baolu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 18:51                           ` Jason Gunthorpe
  2022-05-31 21:22                             ` Robin Murphy
@ 2022-06-01  6:39                             ` Baolu Lu
  1 sibling, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-01  6:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy
  Cc: baolu.lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 2022/6/1 02:51, Jason Gunthorpe wrote:
>> Oh, I've spent the last couple of weeks hacking up horrible things
>> manipulating entries in init_mm, and never realised that that was actually
>> the special case. Oh well, live and learn.
> The init_mm is sort of different, it doesn't have zap in quite the
> same way, for example. I was talking about the typical process mm.
> 
> Anyhow, the right solution is to use RCU as I described before, Baolu
> do you want to try?

Yes, of course.

Your discussion with Robin gave me a lot of inspiration. Very
appreciated! I want to use a separate patch to solve this debugfs
problem, because it has exceeded the original intention of this series.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 23:10                               ` Jason Gunthorpe
@ 2022-06-01  8:53                                 ` Tian, Kevin
  2022-06-01 12:18                                 ` Joao Martins
  1 sibling, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  8:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy, Martins, Joao
  Cc: Baolu Lu, Joerg Roedel, Raj, Ashok, Christoph Hellwig,
	Will Deacon, Liu, Yi L, Pan, Jacob jun, iommu, linux-kernel

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 1, 2022 7:11 AM
> 
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
> 
> > There are only 3 instances where we'll free a table while the domain is
> > live. The first is the one legitimate race condition, where two map requests
> > targeting relatively nearby PTEs both go to fill in an intermediate level of
> > table; whoever loses that race frees the table they allocated, but it was
> > never visible to anyone else so that's definitely fine. The second is if
> > we're mapping a block entry, and find that there's already a table entry
> > there, wherein we assume the table must be empty, clear the entry,
> > invalidate any walk caches, install the block entry, then free the orphaned
> > table; since we're mapping the entire IOVA range covered by that table,
> > there should be no other operations on that IOVA range attempting to walk
> > the table at the same time, so it's fine.
> 
> I saw these two in the Intel driver
> 
> > The third is effectively the inverse, if we get a block-sized unmap
> > but find a table entry rather than a block at that point (on the
> > assumption that it's de-facto allowed for a single unmap to cover
> > multiple adjacent mappings as long as it does so exactly); similarly
> > we assume that the table must be full, and no other operations
> > should be racing because we're unmapping its whole IOVA range, so we
> > remove the table entry, invalidate, and free as before.
> 
> Not sure I noticed this one though
> 
> This all it all makes sense though.

Intel driver also does this. See dma_pte_clear_level():

		/* If range covers entire pagetable, free it */
		if (start_pfn <= level_pfn &&
		     last_pfn >= level_pfn + level_size(level) - 1) {
			...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain()
  2022-05-27  6:30 ` [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain() Lu Baolu
  2022-05-27 15:00   ` Jason Gunthorpe
@ 2022-06-01  8:53   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  8:53 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> The per-device device_domain_info data could be retrieved from the
> device itself. There's no need to search a global list.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
>  drivers/iommu/intel/iommu.h |  2 --
>  drivers/iommu/intel/iommu.c | 25 -------------------------
>  drivers/iommu/intel/pasid.c | 37 +++++++++++--------------------------
>  3 files changed, 11 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 8a6d64d726c0..2f4a5b9509c0 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -727,8 +727,6 @@ extern int dmar_ir_support(void);
>  void *alloc_pgtable_page(int node);
>  void free_pgtable_page(void *vaddr);
>  struct intel_iommu *domain_get_iommu(struct dmar_domain *domain);
> -int for_each_device_domain(int (*fn)(struct device_domain_info *info,
> -				     void *data), void *data);
>  void iommu_flush_write_buffer(struct intel_iommu *iommu);
>  int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device
> *dev);
>  struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8
> *devfn);
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index cacae8bdaa65..6549b09d7f32 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -316,31 +316,6 @@ static int iommu_skip_te_disable;
> 
>  static DEFINE_SPINLOCK(device_domain_lock);
>  static LIST_HEAD(device_domain_list);
> -
> -/*
> - * Iterate over elements in device_domain_list and call the specified
> - * callback @fn against each element.
> - */
> -int for_each_device_domain(int (*fn)(struct device_domain_info *info,
> -				     void *data), void *data)
> -{
> -	int ret = 0;
> -	unsigned long flags;
> -	struct device_domain_info *info;
> -
> -	spin_lock_irqsave(&device_domain_lock, flags);
> -	list_for_each_entry(info, &device_domain_list, global) {
> -		ret = fn(info, data);
> -		if (ret) {
> -			spin_unlock_irqrestore(&device_domain_lock, flags);
> -			return ret;
> -		}
> -	}
> -	spin_unlock_irqrestore(&device_domain_lock, flags);
> -
> -	return 0;
> -}
> -
>  const struct iommu_ops intel_iommu_ops;
> 
>  static bool translation_pre_enabled(struct intel_iommu *iommu)
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index b2ac5869286f..0627d6465f25 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -103,36 +103,20 @@ device_detach_pasid_table(struct
> device_domain_info *info,
>  }
> 
>  struct pasid_table_opaque {
> -	struct pasid_table	**pasid_table;
> -	int			segment;
> -	int			bus;
> -	int			devfn;
> +	struct pasid_table	*pasid_table;
>  };
> 
> -static int search_pasid_table(struct device_domain_info *info, void *opaque)
> -{
> -	struct pasid_table_opaque *data = opaque;
> -
> -	if (info->iommu->segment == data->segment &&
> -	    info->bus == data->bus &&
> -	    info->devfn == data->devfn &&
> -	    info->pasid_table) {
> -		*data->pasid_table = info->pasid_table;
> -		return 1;
> -	}
> -
> -	return 0;
> -}
> -
>  static int get_alias_pasid_table(struct pci_dev *pdev, u16 alias, void
> *opaque)
>  {
>  	struct pasid_table_opaque *data = opaque;
> +	struct device_domain_info *info;
> 
> -	data->segment = pci_domain_nr(pdev->bus);
> -	data->bus = PCI_BUS_NUM(alias);
> -	data->devfn = alias & 0xff;
> +	info = dev_iommu_priv_get(&pdev->dev);
> +	if (!info || !info->pasid_table)
> +		return 0;
> 
> -	return for_each_device_domain(&search_pasid_table, data);
> +	data->pasid_table = info->pasid_table;
> +	return 1;
>  }
> 
>  /*
> @@ -141,9 +125,9 @@ static int get_alias_pasid_table(struct pci_dev *pdev,
> u16 alias, void *opaque)
>   */
>  int intel_pasid_alloc_table(struct device *dev)
>  {
> +	struct pasid_table_opaque data = { NULL };
>  	struct device_domain_info *info;
>  	struct pasid_table *pasid_table;
> -	struct pasid_table_opaque data;
>  	struct page *pages;
>  	u32 max_pasid = 0;
>  	int ret, order;
> @@ -155,11 +139,12 @@ int intel_pasid_alloc_table(struct device *dev)
>  		return -EINVAL;
> 
>  	/* DMA alias device already has a pasid table, use it: */
> -	data.pasid_table = &pasid_table;
>  	ret = pci_for_each_dma_alias(to_pci_dev(dev),
>  				     &get_alias_pasid_table, &data);
> -	if (ret)
> +	if (ret) {
> +		pasid_table = data.pasid_table;
>  		goto attach_out;
> +	}
> 
>  	pasid_table = kzalloc(sizeof(*pasid_table), GFP_KERNEL);
>  	if (!pasid_table)
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk()
  2022-05-27  6:30 ` [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk() Lu Baolu
  2022-05-27 15:01   ` Jason Gunthorpe
@ 2022-06-01  8:56   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  8:56 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> Use pci_get_domain_bus_and_slot() instead of searching the global list
> to retrieve the pci device pointer. This removes device_domain_list
> global list as there are no consumers anymore.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
>  drivers/iommu/intel/iommu.h |  1 -
>  drivers/iommu/intel/iommu.c | 33 ++++++---------------------------
>  2 files changed, 6 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> index 2f4a5b9509c0..6724703d573b 100644
> --- a/drivers/iommu/intel/iommu.h
> +++ b/drivers/iommu/intel/iommu.h
> @@ -609,7 +609,6 @@ struct intel_iommu {
>  /* PCI domain-device relationship */
>  struct device_domain_info {
>  	struct list_head link;	/* link to domain siblings */
> -	struct list_head global; /* link to global list */
>  	struct list_head table;	/* link to pasid table */
>  	u32 segment;		/* PCI segment number */
>  	u8 bus;			/* PCI bus number */
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 25d4c5200526..bbdd3417a1b1 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -131,8 +131,6 @@ static struct intel_iommu **g_iommus;
> 
>  static void __init check_tylersburg_isoch(void);
>  static int rwbf_quirk;
> -static inline struct device_domain_info *
> -dmar_search_domain_by_dev_info(int segment, int bus, int devfn);
> 
>  /*
>   * set to 1 to panic kernel if can't successfully enable VT-d
> @@ -315,7 +313,6 @@ static int iommu_skip_te_disable;
>  #define IDENTMAP_AZALIA		4
> 
>  static DEFINE_SPINLOCK(device_domain_lock);
> -static LIST_HEAD(device_domain_list);
>  const struct iommu_ops intel_iommu_ops;
> 
>  static bool translation_pre_enabled(struct intel_iommu *iommu)
> @@ -845,9 +842,14 @@ static void pgtable_walk(struct intel_iommu
> *iommu, unsigned long pfn, u8 bus, u
>  	struct device_domain_info *info;
>  	struct dma_pte *parent, *pte;
>  	struct dmar_domain *domain;
> +	struct pci_dev *pdev;
>  	int offset, level;
> 
> -	info = dmar_search_domain_by_dev_info(iommu->segment, bus,
> devfn);
> +	pdev = pci_get_domain_bus_and_slot(iommu->segment, bus, devfn);
> +	if (!pdev)
> +		return;
> +
> +	info = dev_iommu_priv_get(&pdev->dev);
>  	if (!info || !info->domain) {
>  		pr_info("device [%02x:%02x.%d] not probed\n",
>  			bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
> @@ -2348,19 +2350,6 @@ static void domain_remove_dev_info(struct
> dmar_domain *domain)
>  	spin_unlock_irqrestore(&device_domain_lock, flags);
>  }
> 
> -static inline struct device_domain_info *
> -dmar_search_domain_by_dev_info(int segment, int bus, int devfn)
> -{
> -	struct device_domain_info *info;
> -
> -	list_for_each_entry(info, &device_domain_list, global)
> -		if (info->segment == segment && info->bus == bus &&
> -		    info->devfn == devfn)
> -			return info;
> -
> -	return NULL;
> -}
> -
>  static int domain_setup_first_level(struct intel_iommu *iommu,
>  				    struct dmar_domain *domain,
>  				    struct device *dev,
> @@ -4564,7 +4553,6 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
>  	struct pci_dev *pdev = dev_is_pci(dev) ? to_pci_dev(dev) : NULL;
>  	struct device_domain_info *info;
>  	struct intel_iommu *iommu;
> -	unsigned long flags;
>  	u8 bus, devfn;
> 
>  	iommu = device_to_iommu(dev, &bus, &devfn);
> @@ -4607,10 +4595,7 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
>  		}
>  	}
> 
> -	spin_lock_irqsave(&device_domain_lock, flags);
> -	list_add(&info->global, &device_domain_list);
>  	dev_iommu_priv_set(dev, info);
> -	spin_unlock_irqrestore(&device_domain_lock, flags);
> 
>  	return &iommu->iommu;
>  }
> @@ -4618,15 +4603,9 @@ static struct iommu_device
> *intel_iommu_probe_device(struct device *dev)
>  static void intel_iommu_release_device(struct device *dev)
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
> -	unsigned long flags;
> 
>  	dmar_remove_one_dev_info(dev);
> -
> -	spin_lock_irqsave(&device_domain_lock, flags);
>  	dev_iommu_priv_set(dev, NULL);
> -	list_del(&info->global);
> -	spin_unlock_irqrestore(&device_domain_lock, flags);
> -
>  	kfree(info);
>  	set_dma_ops(dev, NULL);
>  }
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free
  2022-05-27  6:30 ` [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free Lu Baolu
@ 2022-06-01  9:05   ` Tian, Kevin
  0 siblings, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  9:05 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> The IOMMU root table is allocated and freed in the IOMMU initialization
> code in static boot or hot-plug paths. There's no need for a spinlock.

s/hot-plug/hot-remove/

> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
>  drivers/iommu/intel/iommu.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index bbdd3417a1b1..2d5f02b85de8 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -809,14 +809,12 @@ static int device_context_mapped(struct
> intel_iommu *iommu, u8 bus, u8 devfn)
> 
>  static void free_context_table(struct intel_iommu *iommu)
>  {
> -	int i;
> -	unsigned long flags;
>  	struct context_entry *context;
> +	int i;
> +
> +	if (!iommu->root_entry)
> +		return;
> 
> -	spin_lock_irqsave(&iommu->lock, flags);
> -	if (!iommu->root_entry) {
> -		goto out;
> -	}
>  	for (i = 0; i < ROOT_ENTRY_NR; i++) {
>  		context = iommu_context_addr(iommu, i, 0, 0);
>  		if (context)
> @@ -828,12 +826,10 @@ static void free_context_table(struct intel_iommu
> *iommu)
>  		context = iommu_context_addr(iommu, i, 0x80, 0);
>  		if (context)
>  			free_pgtable_page(context);
> -
>  	}
> +
>  	free_pgtable_page(iommu->root_entry);
>  	iommu->root_entry = NULL;
> -out:
> -	spin_unlock_irqrestore(&iommu->lock, flags);
>  }
> 
>  #ifdef CONFIG_DMAR_DEBUG
> @@ -1232,7 +1228,6 @@ static void domain_unmap(struct dmar_domain
> *domain, unsigned long start_pfn,
>  static int iommu_alloc_root_entry(struct intel_iommu *iommu)
>  {
>  	struct root_entry *root;
> -	unsigned long flags;
> 
>  	root = (struct root_entry *)alloc_pgtable_page(iommu->node);
>  	if (!root) {
> @@ -1242,10 +1237,7 @@ static int iommu_alloc_root_entry(struct
> intel_iommu *iommu)
>  	}
> 
>  	__iommu_flush_cache(iommu, root, ROOT_SIZE);
> -
> -	spin_lock_irqsave(&iommu->lock, flags);
>  	iommu->root_entry = root;
> -	spin_unlock_irqrestore(&iommu->lock, flags);
> 
>  	return 0;
>  }
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers
  2022-05-27  6:30 ` [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers Lu Baolu
@ 2022-06-01  9:09   ` Tian, Kevin
  2022-06-01 10:38     ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  9:09 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> The iommu->lock is used to protect the per-IOMMU domain ID resource.
> Move the spinlock acquisition/release into the helpers where domain
> IDs are allocated and freed. The device_domain_lock is irrelevant to
> domain ID resources, remove its assertion as well.

while moving the lock you also replace spin_lock_irqsave() with spin_lock().
It'd be cleaner to just do movement here and then replace all _irqsave()
in patch 8.

> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel/iommu.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 2d5f02b85de8..0da937ce0534 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -1774,16 +1774,13 @@ static struct dmar_domain
> *alloc_domain(unsigned int type)
>  	return domain;
>  }
> 
> -/* Must be called with iommu->lock */
>  static int domain_attach_iommu(struct dmar_domain *domain,
>  			       struct intel_iommu *iommu)
>  {
>  	unsigned long ndomains;
> -	int num;
> -
> -	assert_spin_locked(&device_domain_lock);
> -	assert_spin_locked(&iommu->lock);
> +	int num, ret = 0;
> 
> +	spin_lock(&iommu->lock);
>  	domain->iommu_refcnt[iommu->seq_id] += 1;
>  	if (domain->iommu_refcnt[iommu->seq_id] == 1) {
>  		ndomains = cap_ndoms(iommu->cap);
> @@ -1792,7 +1789,8 @@ static int domain_attach_iommu(struct
> dmar_domain *domain,
>  		if (num >= ndomains) {
>  			pr_err("%s: No free domain ids\n", iommu->name);
>  			domain->iommu_refcnt[iommu->seq_id] -= 1;
> -			return -ENOSPC;
> +			ret = -ENOSPC;
> +			goto out_unlock;
>  		}
> 
>  		set_bit(num, iommu->domain_ids);
> @@ -1801,7 +1799,9 @@ static int domain_attach_iommu(struct
> dmar_domain *domain,
>  		domain_update_iommu_cap(domain);
>  	}
> 
> -	return 0;
> +out_unlock:
> +	spin_unlock(&iommu->lock);
> +	return ret;
>  }
> 
>  static void domain_detach_iommu(struct dmar_domain *domain,
> @@ -1809,9 +1809,7 @@ static void domain_detach_iommu(struct
> dmar_domain *domain,
>  {
>  	int num;
> 
> -	assert_spin_locked(&device_domain_lock);
> -	assert_spin_locked(&iommu->lock);
> -
> +	spin_lock(&iommu->lock);
>  	domain->iommu_refcnt[iommu->seq_id] -= 1;
>  	if (domain->iommu_refcnt[iommu->seq_id] == 0) {
>  		num = domain->iommu_did[iommu->seq_id];
> @@ -1819,6 +1817,7 @@ static void domain_detach_iommu(struct
> dmar_domain *domain,
>  		domain_update_iommu_cap(domain);
>  		domain->iommu_did[iommu->seq_id] = 0;
>  	}
> +	spin_unlock(&iommu->lock);
>  }
> 
>  static inline int guestwidth_to_adjustwidth(int gaw)
> @@ -2471,9 +2470,7 @@ static int domain_add_dev_info(struct
> dmar_domain *domain, struct device *dev)
> 
>  	spin_lock_irqsave(&device_domain_lock, flags);
>  	info->domain = domain;
> -	spin_lock(&iommu->lock);
>  	ret = domain_attach_iommu(domain, iommu);
> -	spin_unlock(&iommu->lock);
>  	if (ret) {
>  		spin_unlock_irqrestore(&device_domain_lock, flags);
>  		return ret;
> @@ -4158,7 +4155,6 @@ static void __dmar_remove_one_dev_info(struct
> device_domain_info *info)
>  {
>  	struct dmar_domain *domain;
>  	struct intel_iommu *iommu;
> -	unsigned long flags;
> 
>  	assert_spin_locked(&device_domain_lock);
> 
> @@ -4179,10 +4175,7 @@ static void __dmar_remove_one_dev_info(struct
> device_domain_info *info)
>  	}
> 
>  	list_del(&info->link);
> -
> -	spin_lock_irqsave(&iommu->lock, flags);
>  	domain_detach_iommu(domain, iommu);
> -	spin_unlock_irqrestore(&iommu->lock, flags);
>  }
> 
>  static void dmar_remove_one_dev_info(struct device *dev)
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers
  2022-05-27  6:30 ` [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers Lu Baolu
@ 2022-06-01  9:18   ` Tian, Kevin
  2022-06-01 10:48     ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  9:18 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> The iommu->lock is used to protect the per-IOMMU pasid directory table
> and pasid table. Move the spinlock acquisition/release into the helpers
> to make the code self-contained.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with one nit

> 
> -	/* Caller must ensure PASID entry is not in use. */
> -	if (pasid_pte_is_present(pte))
> -		return -EBUSY;
> +	spin_lock(&iommu->lock);
> +	pte = get_non_present_pasid_entry(dev, pasid);
> +	if (!pte) {
> +		spin_unlock(&iommu->lock);
> +		return -ENODEV;
> +	}

I don't think above is a good abstraction and it changes the error
code for an present entry from -EBUSY to -ENODEV.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-05-27  6:30 ` [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path Lu Baolu
  2022-05-27 15:05   ` Jason Gunthorpe
@ 2022-06-01  9:28   ` Tian, Kevin
  2022-06-01 11:02     ` Baolu Lu
  1 sibling, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2022-06-01  9:28 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, May 27, 2022 2:30 PM
> 
> When the IOMMU domain is about to be freed, it should not be set on any
> device. Instead of silently dealing with some bug cases, it's better to
> trigger a warning to report and fix any potential bugs at the first time.
> 


>  static void domain_exit(struct dmar_domain *domain)
>  {
> -
> -	/* Remove associated devices and clear attached or cached domains
> */
> -	domain_remove_dev_info(domain);
> +	if (WARN_ON(!list_empty(&domain->devices)))
> +		return;
> 

warning is good but it doesn't mean the driver shouldn't deal with
that situation to make it safer e.g. blocking DMA from all attached
device...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers
  2022-06-01  9:09   ` Tian, Kevin
@ 2022-06-01 10:38     ` Baolu Lu
  0 siblings, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-01 10:38 UTC (permalink / raw)
  To: Tian, Kevin, Joerg Roedel, Raj, Ashok, Christoph Hellwig,
	Jason Gunthorpe
  Cc: baolu.lu, Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun,
	iommu, linux-kernel

Hi Kevin,

Thank you for the comments.

On 2022/6/1 17:09, Tian, Kevin wrote:
>> From: Lu Baolu<baolu.lu@linux.intel.com>
>> Sent: Friday, May 27, 2022 2:30 PM
>>
>> The iommu->lock is used to protect the per-IOMMU domain ID resource.
>> Move the spinlock acquisition/release into the helpers where domain
>> IDs are allocated and freed. The device_domain_lock is irrelevant to
>> domain ID resources, remove its assertion as well.
> while moving the lock you also replace spin_lock_irqsave() with spin_lock().
> It'd be cleaner to just do movement here and then replace all _irqsave()
> in patch 8.

Yeah, that will be clearer.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers
  2022-06-01  9:18   ` Tian, Kevin
@ 2022-06-01 10:48     ` Baolu Lu
  0 siblings, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-01 10:48 UTC (permalink / raw)
  To: Tian, Kevin, Joerg Roedel, Raj, Ashok, Christoph Hellwig,
	Jason Gunthorpe
  Cc: baolu.lu, Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun,
	iommu, linux-kernel

On 2022/6/1 17:18, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Friday, May 27, 2022 2:30 PM
>>
>> The iommu->lock is used to protect the per-IOMMU pasid directory table
>> and pasid table. Move the spinlock acquisition/release into the helpers
>> to make the code self-contained.
>>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with one nit
> 
>>
>> -	/* Caller must ensure PASID entry is not in use. */
>> -	if (pasid_pte_is_present(pte))
>> -		return -EBUSY;
>> +	spin_lock(&iommu->lock);
>> +	pte = get_non_present_pasid_entry(dev, pasid);
>> +	if (!pte) {
>> +		spin_unlock(&iommu->lock);
>> +		return -ENODEV;
>> +	}
> 
> I don't think above is a good abstraction and it changes the error
> code for an present entry from -EBUSY to -ENODEV.

Sure. I will roll it back to -EBUSY. I added this helper because the
same code appears at least three times in this file.

Best regards,
baolu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-06-01  9:28   ` Tian, Kevin
@ 2022-06-01 11:02     ` Baolu Lu
  2022-06-02  6:29       ` Tian, Kevin
  0 siblings, 1 reply; 55+ messages in thread
From: Baolu Lu @ 2022-06-01 11:02 UTC (permalink / raw)
  To: Tian, Kevin, Joerg Roedel, Raj, Ashok, Christoph Hellwig,
	Jason Gunthorpe
  Cc: baolu.lu, Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun,
	iommu, linux-kernel

On 2022/6/1 17:28, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Friday, May 27, 2022 2:30 PM
>>
>> When the IOMMU domain is about to be freed, it should not be set on any
>> device. Instead of silently dealing with some bug cases, it's better to
>> trigger a warning to report and fix any potential bugs at the first time.
>>
> 
> 
>>   static void domain_exit(struct dmar_domain *domain)
>>   {
>> -
>> -	/* Remove associated devices and clear attached or cached domains
>> */
>> -	domain_remove_dev_info(domain);
>> +	if (WARN_ON(!list_empty(&domain->devices)))
>> +		return;
>>
> 
> warning is good but it doesn't mean the driver shouldn't deal with
> that situation to make it safer e.g. blocking DMA from all attached
> device...

I have ever thought the same thing. :-)

Blocking DMA from attached device should be done when setting blocking
domain to the device. It should not be part of freeing a domain.

Here, the caller asks the driver to free the domain, but the driver
finds that something is wrong. Therefore, it warns and returns directly.
The domain will still be there in use until the next set_domain().

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-05-31 23:10                               ` Jason Gunthorpe
  2022-06-01  8:53                                 ` Tian, Kevin
@ 2022-06-01 12:18                                 ` Joao Martins
  2022-06-01 12:33                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-06-01 12:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy
  Cc: Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj, Christoph Hellwig,
	Will Deacon, Liu Yi L, Jacob jun Pan, iommu, linux-kernel

On 6/1/22 00:10, Jason Gunthorpe wrote:
> On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote:
>> There are only 3 instances where we'll free a table while the domain is
>> live. The first is the one legitimate race condition, where two map requests
>> targeting relatively nearby PTEs both go to fill in an intermediate level of
>> table; whoever loses that race frees the table they allocated, but it was
>> never visible to anyone else so that's definitely fine. The second is if
>> we're mapping a block entry, and find that there's already a table entry
>> there, wherein we assume the table must be empty, clear the entry,
>> invalidate any walk caches, install the block entry, then free the orphaned
>> table; since we're mapping the entire IOVA range covered by that table,
>> there should be no other operations on that IOVA range attempting to walk
>> the table at the same time, so it's fine. 
> 
> I saw these two in the Intel driver
> 
>> The third is effectively the inverse, if we get a block-sized unmap
>> but find a table entry rather than a block at that point (on the
>> assumption that it's de-facto allowed for a single unmap to cover
>> multiple adjacent mappings as long as it does so exactly); similarly
>> we assume that the table must be full, and no other operations
>> should be racing because we're unmapping its whole IOVA range, so we
>> remove the table entry, invalidate, and free as before.
> 
> Not sure I noticed this one though
> 
> This all it all makes sense though.
> 

I saw all three incantations in AMD iommu /I think/

AMD has sort of a replicated PTEs concept representing a power of 2 page size
mapped in 'default' page sizes(4K, 2M, 1G), which we need to check every single
one of them. Which upon writing the RFC I realized it's not really the most
efficient thing to keep considering the IOMMU hardware doesn't guarantee the
dirty bit is on every replicated pte. And maybe it's clobbering the fact we
don't attempt to pick the best page-size for the IOVA mapping (like Intel),
to instead have all powers of 2 page sizes allowed and IOMMU hw tries its
best to cope.

>> Although we don't have debug dumping for io-pgtable-arm, it's good to be
>> thinking about this, since it's made me realise that dirty-tracking sweeps
>> per that proposal might pose a similar kind of concern, so we might still
>> need to harden these corners for the sake of that.
> 
> Lets make sure Joao sees this..
> 
> It is interesting because we probably don't want the big latency
> spikes that would come from using locking to block map/unmap while
> dirty reading is happening - eg at the iommfd level.
> 
Specially when we might be scanning big IOVA ranges.

> From a consistency POV the only case that matters is unmap and unmap
> should already be doing a dedicated dirty read directly prior to the
> unmap (as per that other long thread)
> 
/me nods, yes

FWIW, I've already removed the unmap_read_dirty ops.

> So having safe racy reading in the kernel is probably best, and so RCU
> would be good here too.
> 

Reading dirties ought to be similar to map/unmap but slightly simpler as
I supposedly don't need to care about the pte changing under the hood (or
so I initially thought). I was wrestling at some point if test-and-clear
was enough or whether I switch back cmpxchg to detect the pte has changed
and only mark dirty based on the old value[*]. The latter would align with
how map/unmap performs the pte updates.

[*] e.g. potentially the new entry hasn't been 'seen' by IOMMU walker or
might be a smaller size then what got dirtied with iopte split or racing
with a new map

>> that somewhere I have some half-finished patches making io-pgtable-arm use
>> the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once
>> (perhaps we might even be able to RCU-ify the freelist generically? I'll see
>> how it goes when I get there).
> 
> This is all very similar to how the mm's tlb gather stuff works,
> especially on PPC which requires RCU protection of page tables instead
> of the IPI trick.
> 
> Currently the rcu_head and the lru overlap in the struct page. To do
> this I'd suggest following what was done for slab - see ca1a46d6f506
> ("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class
> like 'struct slab' for use with iommu page tables and put the
> list_head and rcu_head sequentially in the struct iommu_pt_page.
> 
> Continue to use put_pages_list() just with the list_head in the new
> struct not the lru.
> 
> If we need it for dirty tracking then it makes sense to start
> progressing parts at least..
> 
The suggestions here seem to apply to both map/unmap too, not
just read_dirty() alone IIUC

I am not sure yet on dynamic demote/promote of page sizes if it changes this.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-06-01 12:18                                 ` Joao Martins
@ 2022-06-01 12:33                                   ` Jason Gunthorpe
  2022-06-01 13:52                                     ` Joao Martins
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2022-06-01 12:33 UTC (permalink / raw)
  To: Joao Martins
  Cc: Robin Murphy, Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj,
	Christoph Hellwig, Will Deacon, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:

> > So having safe racy reading in the kernel is probably best, and so RCU
> > would be good here too.
> 
> Reading dirties ought to be similar to map/unmap but slightly simpler as
> I supposedly don't need to care about the pte changing under the hood (or
> so I initially thought). I was wrestling at some point if test-and-clear
> was enough or whether I switch back cmpxchg to detect the pte has changed
> and only mark dirty based on the old value[*]. The latter would align with
> how map/unmap performs the pte updates.

test-and-clear should be fine, but this all needs to be done under a
RCU context while the page tables themsevles are freed by RCU. Then
you can safely chase the page table pointers down to each level
without fear of UAF.

> I am not sure yet on dynamic demote/promote of page sizes if it changes this.

For this kind of primitive the caller must provide the locking, just
like map/unmap.

Effectively you can consider the iommu_domain has having externally
provided range-locks over the IOVA space. map/unmap/demote/promote
must run serially over intersecting IOVA ranges.

In terms of iommufd this means we always have to hold a lock related
to the area (which is the IOVA range) before issuing any iommu call on
the domain.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-06-01 12:33                                   ` Jason Gunthorpe
@ 2022-06-01 13:52                                     ` Joao Martins
  2022-06-01 14:22                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Joao Martins @ 2022-06-01 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj,
	Christoph Hellwig, Will Deacon, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On 6/1/22 13:33, Jason Gunthorpe wrote:
> On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:
> 
>>> So having safe racy reading in the kernel is probably best, and so RCU
>>> would be good here too.
>>
>> Reading dirties ought to be similar to map/unmap but slightly simpler as
>> I supposedly don't need to care about the pte changing under the hood (or
>> so I initially thought). I was wrestling at some point if test-and-clear
>> was enough or whether I switch back cmpxchg to detect the pte has changed
>> and only mark dirty based on the old value[*]. The latter would align with
>> how map/unmap performs the pte updates.
> 
> test-and-clear should be fine, but this all needs to be done under a
> RCU context while the page tables themsevles are freed by RCU. Then
> you can safely chase the page table pointers down to each level
> without fear of UAF.
> 

I was actually thinking more towards holding the same IOVA range lock to
align with the rest of map/unmap/demote/etc? All these IO page table
manip have all have the same performance requirements.

>> I am not sure yet on dynamic demote/promote of page sizes if it changes this.
> 
> For this kind of primitive the caller must provide the locking, just
> like map/unmap.
> 
Ah OK.

> Effectively you can consider the iommu_domain has having externally
> provided range-locks over the IOVA space. map/unmap/demote/promote
> must run serially over intersecting IOVA ranges.
> 
> In terms of iommufd this means we always have to hold a lock related
> to the area (which is the IOVA range) before issuing any iommu call on
> the domain.

/me nods

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
  2022-06-01 13:52                                     ` Joao Martins
@ 2022-06-01 14:22                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2022-06-01 14:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Robin Murphy, Baolu Lu, Joerg Roedel, Kevin Tian, Ashok Raj,
	Christoph Hellwig, Will Deacon, Liu Yi L, Jacob jun Pan, iommu,
	linux-kernel

On Wed, Jun 01, 2022 at 02:52:05PM +0100, Joao Martins wrote:
> On 6/1/22 13:33, Jason Gunthorpe wrote:
> > On Wed, Jun 01, 2022 at 01:18:52PM +0100, Joao Martins wrote:
> > 
> >>> So having safe racy reading in the kernel is probably best, and so RCU
> >>> would be good here too.
> >>
> >> Reading dirties ought to be similar to map/unmap but slightly simpler as
> >> I supposedly don't need to care about the pte changing under the hood (or
> >> so I initially thought). I was wrestling at some point if test-and-clear
> >> was enough or whether I switch back cmpxchg to detect the pte has changed
> >> and only mark dirty based on the old value[*]. The latter would align with
> >> how map/unmap performs the pte updates.
> > 
> > test-and-clear should be fine, but this all needs to be done under a
> > RCU context while the page tables themsevles are freed by RCU. Then
> > you can safely chase the page table pointers down to each level
> > without fear of UAF.
> > 
> 
> I was actually thinking more towards holding the same IOVA range lock to
> align with the rest of map/unmap/demote/etc? All these IO page table
> manip have all have the same performance requirements.

IMHO ideally we would not make read dirty use the IOVA range lock because
we want to read dirty big swaths of IOVA a time and it shouldn't be
forced to chunk based on the arbitary area construction.

But yes this could also be an option.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-06-01 11:02     ` Baolu Lu
@ 2022-06-02  6:29       ` Tian, Kevin
  2022-06-06  1:34         ` Baolu Lu
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2022-06-02  6:29 UTC (permalink / raw)
  To: Baolu Lu, Joerg Roedel, Raj, Ashok, Christoph Hellwig, Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun, iommu,
	linux-kernel

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, June 1, 2022 7:02 PM
> 
> On 2022/6/1 17:28, Tian, Kevin wrote:
> >> From: Lu Baolu <baolu.lu@linux.intel.com>
> >> Sent: Friday, May 27, 2022 2:30 PM
> >>
> >> When the IOMMU domain is about to be freed, it should not be set on
> any
> >> device. Instead of silently dealing with some bug cases, it's better to
> >> trigger a warning to report and fix any potential bugs at the first time.
> >>
> >
> >
> >>   static void domain_exit(struct dmar_domain *domain)
> >>   {
> >> -
> >> -	/* Remove associated devices and clear attached or cached domains
> >> */
> >> -	domain_remove_dev_info(domain);
> >> +	if (WARN_ON(!list_empty(&domain->devices)))
> >> +		return;
> >>
> >
> > warning is good but it doesn't mean the driver shouldn't deal with
> > that situation to make it safer e.g. blocking DMA from all attached
> > device...
> 
> I have ever thought the same thing. :-)
> 
> Blocking DMA from attached device should be done when setting blocking
> domain to the device. It should not be part of freeing a domain.

yes but here we are talking about some bug scenario.

> 
> Here, the caller asks the driver to free the domain, but the driver
> finds that something is wrong. Therefore, it warns and returns directly.
> The domain will still be there in use until the next set_domain().
> 

at least it'd look safer if we always try to unmap the entire domain i.e.:

static void domain_exit(struct dmar_domain *domain)
{
-
-	/* Remove associated devices and clear attached or cached domains */
-	domain_remove_dev_info(domain);

	if (domain->pgd) {
		LIST_HEAD(freelist);

		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
		put_pages_list(&freelist);
	}

+	if (WARN_ON(!list_empty(&domain->devices)))
+		return;

	kfree(domain);
}

Thanks
Kevin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path
  2022-06-02  6:29       ` Tian, Kevin
@ 2022-06-06  1:34         ` Baolu Lu
  0 siblings, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2022-06-06  1:34 UTC (permalink / raw)
  To: Tian, Kevin, Joerg Roedel, Raj, Ashok, Christoph Hellwig,
	Jason Gunthorpe
  Cc: baolu.lu, Will Deacon, Robin Murphy, Liu, Yi L, Pan, Jacob jun,
	iommu, linux-kernel

On 2022/6/2 14:29, Tian, Kevin wrote:
>> From: Baolu Lu <baolu.lu@linux.intel.com>
>> Sent: Wednesday, June 1, 2022 7:02 PM
>>
>> On 2022/6/1 17:28, Tian, Kevin wrote:
>>>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>>> Sent: Friday, May 27, 2022 2:30 PM
>>>>
>>>> When the IOMMU domain is about to be freed, it should not be set on
>> any
>>>> device. Instead of silently dealing with some bug cases, it's better to
>>>> trigger a warning to report and fix any potential bugs at the first time.
>>>>
>>>
>>>
>>>>    static void domain_exit(struct dmar_domain *domain)
>>>>    {
>>>> -
>>>> -	/* Remove associated devices and clear attached or cached domains
>>>> */
>>>> -	domain_remove_dev_info(domain);
>>>> +	if (WARN_ON(!list_empty(&domain->devices)))
>>>> +		return;
>>>>
>>>
>>> warning is good but it doesn't mean the driver shouldn't deal with
>>> that situation to make it safer e.g. blocking DMA from all attached
>>> device...
>>
>> I have ever thought the same thing. :-)
>>
>> Blocking DMA from attached device should be done when setting blocking
>> domain to the device. It should not be part of freeing a domain.
> 
> yes but here we are talking about some bug scenario.
> 
>>
>> Here, the caller asks the driver to free the domain, but the driver
>> finds that something is wrong. Therefore, it warns and returns directly.
>> The domain will still be there in use until the next set_domain().
>>
> 
> at least it'd look safer if we always try to unmap the entire domain i.e.:
> 
> static void domain_exit(struct dmar_domain *domain)
> {
> -
> -	/* Remove associated devices and clear attached or cached domains */
> -	domain_remove_dev_info(domain);
> 
> 	if (domain->pgd) {
> 		LIST_HEAD(freelist);
> 
> 		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
> 		put_pages_list(&freelist);
> 	}
> 
> +	if (WARN_ON(!list_empty(&domain->devices)))
> +		return;
> 
> 	kfree(domain);
> }

Fair enough. Removing all mappings is safer.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-06-06  1:35 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27  6:30 [PATCH 00/12] iommu/vt-d: Optimize the use of locks Lu Baolu
2022-05-27  6:30 ` [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs Lu Baolu
2022-05-27 14:59   ` Jason Gunthorpe
2022-05-29  5:14     ` Baolu Lu
2022-05-30 12:14       ` Jason Gunthorpe
2022-05-31  3:02         ` Baolu Lu
2022-05-31 13:10           ` Jason Gunthorpe
2022-05-31 14:11             ` Baolu Lu
2022-05-31 14:53               ` Jason Gunthorpe
2022-05-31 15:01                 ` Robin Murphy
2022-05-31 15:13                   ` Jason Gunthorpe
2022-05-31 16:01                     ` Robin Murphy
2022-05-31 16:21                       ` Jason Gunthorpe
2022-05-31 18:07                         ` Robin Murphy
2022-05-31 18:51                           ` Jason Gunthorpe
2022-05-31 21:22                             ` Robin Murphy
2022-05-31 23:10                               ` Jason Gunthorpe
2022-06-01  8:53                                 ` Tian, Kevin
2022-06-01 12:18                                 ` Joao Martins
2022-06-01 12:33                                   ` Jason Gunthorpe
2022-06-01 13:52                                     ` Joao Martins
2022-06-01 14:22                                       ` Jason Gunthorpe
2022-06-01  6:39                             ` Baolu Lu
2022-05-31 13:52           ` Robin Murphy
2022-05-31 15:59             ` Jason Gunthorpe
2022-05-31 16:42               ` Robin Murphy
2022-06-01  5:47               ` Baolu Lu
2022-06-01  5:33             ` Baolu Lu
2022-05-27  6:30 ` [PATCH 02/12] iommu/vt-d: Remove for_each_device_domain() Lu Baolu
2022-05-27 15:00   ` Jason Gunthorpe
2022-06-01  8:53   ` Tian, Kevin
2022-05-27  6:30 ` [PATCH 03/12] iommu/vt-d: Remove clearing translation data in disable_dmar_iommu() Lu Baolu
2022-05-27 15:01   ` Jason Gunthorpe
2022-05-29  5:22     ` Baolu Lu
2022-05-27  6:30 ` [PATCH 04/12] iommu/vt-d: Use pci_get_domain_bus_and_slot() in pgtable_walk() Lu Baolu
2022-05-27 15:01   ` Jason Gunthorpe
2022-06-01  8:56   ` Tian, Kevin
2022-05-27  6:30 ` [PATCH 05/12] iommu/vt-d: Unncessary spinlock for root table alloc and free Lu Baolu
2022-06-01  9:05   ` Tian, Kevin
2022-05-27  6:30 ` [PATCH 06/12] iommu/vt-d: Acquiring lock in domain ID allocation helpers Lu Baolu
2022-06-01  9:09   ` Tian, Kevin
2022-06-01 10:38     ` Baolu Lu
2022-05-27  6:30 ` [PATCH 07/12] iommu/vt-d: Acquiring lock in pasid manipulation helpers Lu Baolu
2022-06-01  9:18   ` Tian, Kevin
2022-06-01 10:48     ` Baolu Lu
2022-05-27  6:30 ` [PATCH 08/12] iommu/vt-d: Replace spin_lock_irqsave() with spin_lock() Lu Baolu
2022-05-27  6:30 ` [PATCH 09/12] iommu/vt-d: Check device list of domain in domain free path Lu Baolu
2022-05-27 15:05   ` Jason Gunthorpe
2022-06-01  9:28   ` Tian, Kevin
2022-06-01 11:02     ` Baolu Lu
2022-06-02  6:29       ` Tian, Kevin
2022-06-06  1:34         ` Baolu Lu
2022-05-27  6:30 ` [PATCH 10/12] iommu/vt-d: Fold __dmar_remove_one_dev_info() into its caller Lu Baolu
2022-05-27  6:30 ` [PATCH 11/12] iommu/vt-d: Use device_domain_lock accurately Lu Baolu
2022-05-27  6:30 ` [PATCH 12/12] iommu/vt-d: Convert device_domain_lock into per-domain mutex Lu Baolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).