linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] iommu/s390: Further improvements
@ 2022-10-18 14:51 Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state Niklas Schnelle
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

Hi All,

This series of patches improves the s390 IOMMU driver. These improvements help
existing IOMMU users, mainly vfio-pci, but at the same time are also in
preparation of converting s390 to use the common DMA API implementation in
drivers/iommu/dma-iommu.c instead of its platform specific DMA API in
arch/s390/pci/pci_dma.c that sidesteps the IOMMU driver to control the same
hardware interface directly.

Among the included changes patch 1 improves the robustness of switching IOMMU
domains and patch 2 adds the I/O TLB operations necessary for the DMA API
conversion. Patches 3, 4, and 5 aim to improve performance with patch 5 being
the most intrusive by removing the I/O translation table lock and using atomic
updates instead.

This series goes on top of v7 of my previous series of IOMMU fixes[0] and
similarly is available for easy testing in the iommu_improve_v1 branch with
signed tag s390_iommu_improve_v1 of my git.kernel.org tree[1].

Best regards,
Niklas Schnelle

[0] https://lore.kernel.org/linux-iommu/20221017124558.1386337-1-schnelle@linux.ibm.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/

Niklas Schnelle (5):
  iommu/s390: Make attach succeed even if the device is in error state
  iommu/s390: Add I/O TLB ops
  iommu/s390: Use RCU to allow concurrent domain_list iteration
  iommu/s390: Optimize IOMMU table walking
  s390/pci: use lock-free I/O translation updates

 arch/s390/include/asm/pci.h |   4 +-
 arch/s390/kvm/pci.c         |   6 +-
 arch/s390/pci/pci.c         |  13 +--
 arch/s390/pci/pci_dma.c     |  77 +++++++++------
 drivers/iommu/s390-iommu.c  | 180 ++++++++++++++++++++++++------------
 5 files changed, 183 insertions(+), 97 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state
  2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
@ 2022-10-18 14:51 ` Niklas Schnelle
  2022-10-28 15:55   ` Matthew Rosato
  2022-10-18 14:51 ` [PATCH 2/5] iommu/s390: Add I/O TLB ops Niklas Schnelle
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

If a zPCI device is in the error state while switching IOMMU domains
zpci_register_ioat() will fail and we would end up with the device not
attached to any domain. In this state since zdev->dma_table == NULL
a reset via zpci_hot_reset_device() would wrongfully re-initialize the
device for DMA API usage using zpci_dma_init_device(). As automatic
recovery is currently disabled while attached to an IOMMU domain this
only affects slot resets triggered through other means but will affect
automatic recovery once we switch to using dma-iommu.

Additionally with that switch common code expects attaching to the
default domain to always work so zpci_register_ioat() should only fail
if there is no chance to recover anyway, e.g. if the device has been
unplugged.

Improve the robustness of attach by specifically looking at the status
returned by zpci_mod_fc() to determine if the device is unavailable and
in this case simply ignore the error. Once the device is reset
zpci_hot_reset_device() will then correctly set the domain's DMA
translation tables.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h |  2 +-
 arch/s390/kvm/pci.c         |  6 ++++--
 arch/s390/pci/pci.c         | 11 ++++++-----
 arch/s390/pci/pci_dma.c     |  3 ++-
 drivers/iommu/s390-iommu.c  |  9 +++++++--
 5 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 15f8714ca9b7..07361e2fd8c5 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -221,7 +221,7 @@ void zpci_device_reserved(struct zpci_dev *zdev);
 bool zpci_is_device_configured(struct zpci_dev *zdev);
 
 int zpci_hot_reset_device(struct zpci_dev *zdev);
-int zpci_register_ioat(struct zpci_dev *, u8, u64, u64, u64);
+int zpci_register_ioat(struct zpci_dev *, u8, u64, u64, u64, u8 *);
 int zpci_unregister_ioat(struct zpci_dev *, u8);
 void zpci_remove_reserved_devices(void);
 void zpci_update_fh(struct zpci_dev *zdev, u32 fh);
diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
index c50c1645c0ae..03964c0e1fdf 100644
--- a/arch/s390/kvm/pci.c
+++ b/arch/s390/kvm/pci.c
@@ -434,6 +434,7 @@ static void kvm_s390_pci_dev_release(struct zpci_dev *zdev)
 static int kvm_s390_pci_register_kvm(void *opaque, struct kvm *kvm)
 {
 	struct zpci_dev *zdev = opaque;
+	u8 status;
 	int rc;
 
 	if (!zdev)
@@ -486,7 +487,7 @@ static int kvm_s390_pci_register_kvm(void *opaque, struct kvm *kvm)
 
 	/* Re-register the IOMMU that was already created */
 	rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-				virt_to_phys(zdev->dma_table));
+				virt_to_phys(zdev->dma_table), &status);
 	if (rc)
 		goto clear_gisa;
 
@@ -516,6 +517,7 @@ static void kvm_s390_pci_unregister_kvm(void *opaque)
 {
 	struct zpci_dev *zdev = opaque;
 	struct kvm *kvm;
+	u8 status;
 
 	if (!zdev)
 		return;
@@ -554,7 +556,7 @@ static void kvm_s390_pci_unregister_kvm(void *opaque)
 
 	/* Re-register the IOMMU that was already created */
 	zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-			   virt_to_phys(zdev->dma_table));
+			   virt_to_phys(zdev->dma_table), &status);
 
 out:
 	spin_lock(&kvm->arch.kzdev_list_lock);
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 73cdc5539384..a703dcd94a68 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -116,20 +116,20 @@ EXPORT_SYMBOL_GPL(pci_proc_domain);
 
 /* Modify PCI: Register I/O address translation parameters */
 int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
-		       u64 base, u64 limit, u64 iota)
+		       u64 base, u64 limit, u64 iota, u8 *status)
 {
 	u64 req = ZPCI_CREATE_REQ(zdev->fh, dmaas, ZPCI_MOD_FC_REG_IOAT);
 	struct zpci_fib fib = {0};
-	u8 cc, status;
+	u8 cc;
 
 	WARN_ON_ONCE(iota & 0x3fff);
 	fib.pba = base;
 	fib.pal = limit;
 	fib.iota = iota | ZPCI_IOTA_RTTO_FLAG;
 	fib.gd = zdev->gisa;
-	cc = zpci_mod_fc(req, &fib, &status);
+	cc = zpci_mod_fc(req, &fib, status);
 	if (cc)
-		zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
+		zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, *status);
 	return cc;
 }
 EXPORT_SYMBOL_GPL(zpci_register_ioat);
@@ -764,6 +764,7 @@ EXPORT_SYMBOL_GPL(zpci_disable_device);
  */
 int zpci_hot_reset_device(struct zpci_dev *zdev)
 {
+	u8 status;
 	int rc;
 
 	zpci_dbg(3, "rst fid:%x, fh:%x\n", zdev->fid, zdev->fh);
@@ -787,7 +788,7 @@ int zpci_hot_reset_device(struct zpci_dev *zdev)
 
 	if (zdev->dma_table)
 		rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-					virt_to_phys(zdev->dma_table));
+					virt_to_phys(zdev->dma_table), &status);
 	else
 		rc = zpci_dma_init_device(zdev);
 	if (rc) {
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 227cf0a62800..dee825ee7305 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -547,6 +547,7 @@ static void s390_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	
 int zpci_dma_init_device(struct zpci_dev *zdev)
 {
+	u8 status;
 	int rc;
 
 	/*
@@ -598,7 +599,7 @@ int zpci_dma_init_device(struct zpci_dev *zdev)
 
 	}
 	if (zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-			       virt_to_phys(zdev->dma_table))) {
+			       virt_to_phys(zdev->dma_table), &status)) {
 		rc = -EIO;
 		goto free_bitmap;
 	}
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 6c407b61b25a..ee88e717254b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -98,6 +98,7 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev = to_zpci_dev(dev);
 	unsigned long flags;
+	u8 status;
 	int cc;
 
 	if (!zdev)
@@ -113,8 +114,12 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
 		zpci_dma_exit_device(zdev);
 
 	cc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-				virt_to_phys(s390_domain->dma_table));
-	if (cc)
+				virt_to_phys(s390_domain->dma_table), &status);
+	/*
+	 * If the device is undergoing error recovery the reset code
+	 * will re-establish the new domain.
+	 */
+	if (cc && status != ZPCI_PCI_ST_FUNC_NOT_AVAIL)
 		return -EIO;
 	zdev->dma_table = s390_domain->dma_table;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] iommu/s390: Add I/O TLB ops
  2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state Niklas Schnelle
@ 2022-10-18 14:51 ` Niklas Schnelle
  2022-10-28 16:03   ` Matthew Rosato
  2022-10-18 14:51 ` [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration Niklas Schnelle
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

Currently s390-iommu does an I/O TLB flush (RPCIT) for every update of
the I/O translation table explicitly. For one this is wasteful since
RPCIT can be skipped after a mapping operation if zdev->tlb_refresh is
unset. Moreover we can do a single RPCIT for a range of pages including
whne doing lazy unmapping.

Thankfully both of these optimizations can be achieved by implementing
the IOMMU operations common code provides for the different types of I/O
tlb flushes:

 * flush_iotlb_all: Flushes the I/O TLB for the entire IOVA space
 * iotlb_sync:  Flushes the I/O TLB for a range of pages that can be
   gathered up, for example to implement lazy unmapping.
 * iotlb_sync_map: Flushes the I/O TLB after a mapping operation

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/s390-iommu.c | 76 ++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index ee88e717254b..a4c2e9bc6d83 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -199,14 +199,72 @@ static void s390_iommu_release_device(struct device *dev)
 		__s390_iommu_detach_device(zdev);
 }
 
+static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
+{
+	struct s390_domain *s390_domain = to_s390_domain(domain);
+	struct zpci_dev *zdev;
+	unsigned long flags;
+	int rc;
+
+	spin_lock_irqsave(&s390_domain->list_lock, flags);
+	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
+					zdev->end_dma - zdev->start_dma + 1);
+		if (rc)
+			break;
+	}
+	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+}
+
+static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
+				  struct iommu_iotlb_gather *gather)
+{
+	struct s390_domain *s390_domain = to_s390_domain(domain);
+	size_t size = gather->end - gather->start + 1;
+	struct zpci_dev *zdev;
+	unsigned long flags;
+	int rc;
+
+	/* If gather was never added to there is nothing to flush */
+	if (gather->start == ULONG_MAX)
+		return;
+
+	spin_lock_irqsave(&s390_domain->list_lock, flags);
+	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
+					size);
+		if (rc)
+			break;
+	}
+	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+}
+
+static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
+				      unsigned long iova, size_t size)
+{
+	struct s390_domain *s390_domain = to_s390_domain(domain);
+	struct zpci_dev *zdev;
+	unsigned long flags;
+	int rc;
+
+	spin_lock_irqsave(&s390_domain->list_lock, flags);
+	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+		if (!zdev->tlb_refresh)
+			continue;
+		rc = zpci_refresh_trans((u64)zdev->fh << 32,
+					iova, size);
+		if (rc)
+			break;
+	}
+	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+}
+
 static int s390_iommu_update_trans(struct s390_domain *s390_domain,
 				   phys_addr_t pa, dma_addr_t dma_addr,
 				   unsigned long nr_pages, int flags)
 {
 	phys_addr_t page_addr = pa & PAGE_MASK;
-	dma_addr_t start_dma_addr = dma_addr;
 	unsigned long irq_flags, i;
-	struct zpci_dev *zdev;
 	unsigned long *entry;
 	int rc = 0;
 
@@ -225,15 +283,6 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
 		dma_addr += PAGE_SIZE;
 	}
 
-	spin_lock(&s390_domain->list_lock);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
-		rc = zpci_refresh_trans((u64)zdev->fh << 32,
-					start_dma_addr, nr_pages * PAGE_SIZE);
-		if (rc)
-			break;
-	}
-	spin_unlock(&s390_domain->list_lock);
-
 undo_cpu_trans:
 	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
 		flags = ZPCI_PTE_INVALID;
@@ -340,6 +389,8 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
 	if (rc)
 		return 0;
 
+	iommu_iotlb_gather_add_range(gather, iova, size);
+
 	return size;
 }
 
@@ -384,6 +435,9 @@ static const struct iommu_ops s390_iommu_ops = {
 		.detach_dev	= s390_iommu_detach_device,
 		.map_pages	= s390_iommu_map_pages,
 		.unmap_pages	= s390_iommu_unmap_pages,
+		.flush_iotlb_all = s390_iommu_flush_iotlb_all,
+		.iotlb_sync      = s390_iommu_iotlb_sync,
+		.iotlb_sync_map  = s390_iommu_iotlb_sync_map,
 		.iova_to_phys	= s390_iommu_iova_to_phys,
 		.free		= s390_domain_free,
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 2/5] iommu/s390: Add I/O TLB ops Niklas Schnelle
@ 2022-10-18 14:51 ` Niklas Schnelle
  2022-10-18 15:18   ` Jason Gunthorpe
  2022-10-18 14:51 ` [PATCH 4/5] iommu/s390: Optimize IOMMU table walking Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 5/5] s390/pci: use lock-free I/O translation updates Niklas Schnelle
  4 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

The s390_domain->devices list is only added to when new devices are
attached but is iterated through in read-only fashion for every mapping
operation as well as for I/O TLB flushes and thus in performance
critical code causing contention on the s390_domain->list_lock.
Fortunately such a read-mostly linked list is a standard use case for
RCU. This change closely follows the example fpr RCU protected list
given in Documentation/RCU/listRCU.rst.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h |  1 +
 arch/s390/pci/pci.c         |  2 +-
 drivers/iommu/s390-iommu.c  | 31 ++++++++++++++++---------------
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 07361e2fd8c5..e4c3e4e04d30 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -119,6 +119,7 @@ struct zpci_dev {
 	struct list_head entry;		/* list of all zpci_devices, needed for hotplug, etc. */
 	struct list_head iommu_list;
 	struct kref kref;
+	struct rcu_head rcu;
 	struct hotplug_slot hotplug_slot;
 
 	enum zpci_state state;
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index a703dcd94a68..ef38b1514c77 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -996,7 +996,7 @@ void zpci_release_device(struct kref *kref)
 		break;
 	}
 	zpci_dbg(3, "rem fid:%x\n", zdev->fid);
-	kfree(zdev);
+	kfree_rcu(zdev, rcu);
 }
 
 int zpci_report_error(struct pci_dev *pdev,
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index a4c2e9bc6d83..4e90987be387 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -10,6 +10,8 @@
 #include <linux/iommu.h>
 #include <linux/iommu-helper.h>
 #include <linux/sizes.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
 #include <asm/pci_dma.h>
 
 static const struct iommu_ops s390_iommu_ops;
@@ -61,7 +63,7 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 
 	spin_lock_init(&s390_domain->dma_table_lock);
 	spin_lock_init(&s390_domain->list_lock);
-	INIT_LIST_HEAD(&s390_domain->devices);
+	INIT_LIST_HEAD_RCU(&s390_domain->devices);
 
 	return &s390_domain->domain;
 }
@@ -70,7 +72,9 @@ static void s390_domain_free(struct iommu_domain *domain)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 
+	rcu_read_lock();
 	WARN_ON(!list_empty(&s390_domain->devices));
+	rcu_read_unlock();
 	dma_cleanup_tables(s390_domain->dma_table);
 	kfree(s390_domain);
 }
@@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
 		return;
 
 	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_del_init(&zdev->iommu_list);
+	list_del_rcu(&zdev->iommu_list);
 	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
 
 	zpci_unregister_ioat(zdev, 0);
@@ -127,7 +131,7 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
 	zdev->s390_domain = s390_domain;
 
 	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_add(&zdev->iommu_list, &s390_domain->devices);
+	list_add_rcu(&zdev->iommu_list, &s390_domain->devices);
 	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
 
 	return 0;
@@ -203,17 +207,16 @@ static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
 					zdev->end_dma - zdev->start_dma + 1);
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
@@ -222,21 +225,20 @@ static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	size_t size = gather->end - gather->start + 1;
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
 	/* If gather was never added to there is nothing to flush */
 	if (gather->start == ULONG_MAX)
 		return;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
 					size);
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
@@ -244,11 +246,10 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev;
-	unsigned long flags;
 	int rc;
 
-	spin_lock_irqsave(&s390_domain->list_lock, flags);
-	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		if (!zdev->tlb_refresh)
 			continue;
 		rc = zpci_refresh_trans((u64)zdev->fh << 32,
@@ -256,7 +257,7 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 		if (rc)
 			break;
 	}
-	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
+	rcu_read_unlock();
 }
 
 static int s390_iommu_update_trans(struct s390_domain *s390_domain,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] iommu/s390: Optimize IOMMU table walking
  2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
                   ` (2 preceding siblings ...)
  2022-10-18 14:51 ` [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration Niklas Schnelle
@ 2022-10-18 14:51 ` Niklas Schnelle
  2022-10-18 14:51 ` [PATCH 5/5] s390/pci: use lock-free I/O translation updates Niklas Schnelle
  4 siblings, 0 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

When invalidating existing table entries for unmap there is no need to
know the physical address beforehand so don't do an extra walk of the
IOMMU table to get it. Also when invalidating entries not finding an
entry indicates an invalid unmap and not a lack of memory we also don't
need to undo updates in this case. Implement this by splitting
s390_iommu_update_trans() in a variant for validating and one for
invalidating translations.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/s390-iommu.c | 69 ++++++++++++++++++++++++--------------
 1 file changed, 43 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 4e90987be387..efd258e3f74b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -260,14 +260,14 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	rcu_read_unlock();
 }
 
-static int s390_iommu_update_trans(struct s390_domain *s390_domain,
-				   phys_addr_t pa, dma_addr_t dma_addr,
-				   unsigned long nr_pages, int flags)
+static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
+				     phys_addr_t pa, dma_addr_t dma_addr,
+				     unsigned long nr_pages, int flags)
 {
 	phys_addr_t page_addr = pa & PAGE_MASK;
 	unsigned long irq_flags, i;
 	unsigned long *entry;
-	int rc = 0;
+	int rc;
 
 	if (!nr_pages)
 		return 0;
@@ -275,7 +275,7 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
 	spin_lock_irqsave(&s390_domain->dma_table_lock, irq_flags);
 	for (i = 0; i < nr_pages; i++) {
 		entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
-		if (!entry) {
+		if (unlikely(!entry)) {
 			rc = -ENOMEM;
 			goto undo_cpu_trans;
 		}
@@ -283,19 +283,43 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
 		page_addr += PAGE_SIZE;
 		dma_addr += PAGE_SIZE;
 	}
+	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
+
+	return 0;
 
 undo_cpu_trans:
-	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
-		flags = ZPCI_PTE_INVALID;
-		while (i-- > 0) {
-			page_addr -= PAGE_SIZE;
-			dma_addr -= PAGE_SIZE;
-			entry = dma_walk_cpu_trans(s390_domain->dma_table,
-						   dma_addr);
-			if (!entry)
-				break;
-			dma_update_cpu_trans(entry, page_addr, flags);
+	while (i-- > 0) {
+		dma_addr -= PAGE_SIZE;
+		entry = dma_walk_cpu_trans(s390_domain->dma_table,
+					   dma_addr);
+		if (!entry)
+			break;
+		dma_update_cpu_trans(entry, 0, ZPCI_PTE_INVALID);
+	}
+	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
+
+	return rc;
+}
+
+static int s390_iommu_invalidate_trans(struct s390_domain *s390_domain,
+				       dma_addr_t dma_addr, unsigned long nr_pages)
+{
+	unsigned long irq_flags, i;
+	unsigned long *entry;
+	int rc = 0;
+
+	if (!nr_pages)
+		return 0;
+
+	spin_lock_irqsave(&s390_domain->dma_table_lock, irq_flags);
+	for (i = 0; i < nr_pages; i++) {
+		entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
+		if (unlikely(!entry)) {
+			rc = -EINVAL;
+			break;
 		}
+		dma_update_cpu_trans(entry, 0, ZPCI_PTE_INVALID);
+		dma_addr += PAGE_SIZE;
 	}
 	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
 
@@ -308,8 +332,8 @@ static int s390_iommu_map_pages(struct iommu_domain *domain,
 				int prot, gfp_t gfp, size_t *mapped)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
-	int flags = ZPCI_PTE_VALID, rc = 0;
 	size_t size = pgcount << __ffs(pgsize);
+	int flags = ZPCI_PTE_VALID, rc = 0;
 
 	if (pgsize != SZ_4K)
 		return -EINVAL;
@@ -327,8 +351,8 @@ static int s390_iommu_map_pages(struct iommu_domain *domain,
 	if (!(prot & IOMMU_WRITE))
 		flags |= ZPCI_TABLE_PROTECTED;
 
-	rc = s390_iommu_update_trans(s390_domain, paddr, iova,
-				     pgcount, flags);
+	rc = s390_iommu_validate_trans(s390_domain, paddr, iova,
+				       pgcount, flags);
 	if (!rc)
 		*mapped = size;
 
@@ -373,20 +397,13 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	size_t size = pgcount << __ffs(pgsize);
-	int flags = ZPCI_PTE_INVALID;
-	phys_addr_t paddr;
 	int rc;
 
 	if (WARN_ON(iova < s390_domain->domain.geometry.aperture_start ||
 	    (iova + size - 1) > s390_domain->domain.geometry.aperture_end))
 		return 0;
 
-	paddr = s390_iommu_iova_to_phys(domain, iova);
-	if (!paddr)
-		return 0;
-
-	rc = s390_iommu_update_trans(s390_domain, paddr, iova,
-				     pgcount, flags);
+	rc = s390_iommu_invalidate_trans(s390_domain, iova, pgcount);
 	if (rc)
 		return 0;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] s390/pci: use lock-free I/O translation updates
  2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
                   ` (3 preceding siblings ...)
  2022-10-18 14:51 ` [PATCH 4/5] iommu/s390: Optimize IOMMU table walking Niklas Schnelle
@ 2022-10-18 14:51 ` Niklas Schnelle
  4 siblings, 0 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-18 14:51 UTC (permalink / raw)
  To: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

I/O translation tables on s390 use 8 byte page table entries and tables
which are allocated lazily but only freed when the entire I/O
translation table is torn down. Also each IOVA can at any time only
translate to one physical address Furthermore I/O table accesses by the
IOMMU hardware are cache coherent. With a bit of care we can thus use
atomic updates to manipulate the translation table without having to use
a global lock at all. This is done analogous to the existing I/O
translation table handling code used on Intel and AMD x86 systems.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/include/asm/pci.h |  1 -
 arch/s390/pci/pci_dma.c     | 74 ++++++++++++++++++++++---------------
 drivers/iommu/s390-iommu.c  | 37 +++++++------------
 3 files changed, 58 insertions(+), 54 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index e4c3e4e04d30..b248694e0024 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -157,7 +157,6 @@ struct zpci_dev {
 
 	/* DMA stuff */
 	unsigned long	*dma_table;
-	spinlock_t	dma_table_lock;
 	int		tlb_refresh;
 
 	spinlock_t	iommu_bitmap_lock;
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index dee825ee7305..ea478d11fbd1 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -63,37 +63,55 @@ static void dma_free_page_table(void *table)
 	kmem_cache_free(dma_page_table_cache, table);
 }
 
-static unsigned long *dma_get_seg_table_origin(unsigned long *entry)
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
 {
+	unsigned long old_rte, rte;
 	unsigned long *sto;
 
-	if (reg_entry_isvalid(*entry))
-		sto = get_rt_sto(*entry);
-	else {
+	rte = READ_ONCE(*rtep);
+	if (reg_entry_isvalid(rte)) {
+		sto = get_rt_sto(rte);
+	} else {
 		sto = dma_alloc_cpu_table();
 		if (!sto)
 			return NULL;
 
-		set_rt_sto(entry, virt_to_phys(sto));
-		validate_rt_entry(entry);
-		entry_clr_protected(entry);
+		set_rt_sto(&rte, virt_to_phys(sto));
+		validate_rt_entry(&rte);
+		entry_clr_protected(&rte);
+
+		old_rte = cmpxchg(rtep, ZPCI_TABLE_INVALID, rte);
+		if (old_rte != ZPCI_TABLE_INVALID) {
+			/* Somone else was faster, use theirs */
+			dma_free_cpu_table(sto);
+			sto = get_rt_sto(old_rte);
+		}
 	}
 	return sto;
 }
 
-static unsigned long *dma_get_page_table_origin(unsigned long *entry)
+static unsigned long *dma_get_page_table_origin(unsigned long *step)
 {
+	unsigned long old_ste, ste;
 	unsigned long *pto;
 
-	if (reg_entry_isvalid(*entry))
-		pto = get_st_pto(*entry);
-	else {
+	ste = READ_ONCE(*step);
+	if (reg_entry_isvalid(ste)) {
+		pto = get_st_pto(ste);
+	} else {
 		pto = dma_alloc_page_table();
 		if (!pto)
 			return NULL;
-		set_st_pto(entry, virt_to_phys(pto));
-		validate_st_entry(entry);
-		entry_clr_protected(entry);
+		set_st_pto(&ste, virt_to_phys(pto));
+		validate_st_entry(&ste);
+		entry_clr_protected(&ste);
+
+		old_ste = cmpxchg(step, ZPCI_TABLE_INVALID, ste);
+		if (old_ste != ZPCI_TABLE_INVALID) {
+			/* Somone else was faster, use theirs */
+			dma_free_page_table(pto);
+			pto = get_st_pto(old_ste);
+		}
 	}
 	return pto;
 }
@@ -117,19 +135,24 @@ unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
 	return &pto[px];
 }
 
-void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int flags)
+void dma_update_cpu_trans(unsigned long *ptep, phys_addr_t page_addr, int flags)
 {
+	unsigned long pte;
+
+	pte = READ_ONCE(*ptep);
 	if (flags & ZPCI_PTE_INVALID) {
-		invalidate_pt_entry(entry);
+		invalidate_pt_entry(&pte);
 	} else {
-		set_pt_pfaa(entry, page_addr);
-		validate_pt_entry(entry);
+		set_pt_pfaa(&pte, page_addr);
+		validate_pt_entry(&pte);
 	}
 
 	if (flags & ZPCI_TABLE_PROTECTED)
-		entry_set_protected(entry);
+		entry_set_protected(&pte);
 	else
-		entry_clr_protected(entry);
+		entry_clr_protected(&pte);
+
+	xchg(ptep, pte);
 }
 
 static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
@@ -137,18 +160,14 @@ static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
 {
 	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	phys_addr_t page_addr = (pa & PAGE_MASK);
-	unsigned long irq_flags;
 	unsigned long *entry;
 	int i, rc = 0;
 
 	if (!nr_pages)
 		return -EINVAL;
 
-	spin_lock_irqsave(&zdev->dma_table_lock, irq_flags);
-	if (!zdev->dma_table) {
-		rc = -EINVAL;
-		goto out_unlock;
-	}
+	if (!zdev->dma_table)
+		return -EINVAL;
 
 	for (i = 0; i < nr_pages; i++) {
 		entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
@@ -173,8 +192,6 @@ static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
 			dma_update_cpu_trans(entry, page_addr, flags);
 		}
 	}
-out_unlock:
-	spin_unlock_irqrestore(&zdev->dma_table_lock, irq_flags);
 	return rc;
 }
 
@@ -558,7 +575,6 @@ int zpci_dma_init_device(struct zpci_dev *zdev)
 	WARN_ON(zdev->s390_domain);
 
 	spin_lock_init(&zdev->iommu_bitmap_lock);
-	spin_lock_init(&zdev->dma_table_lock);
 
 	zdev->dma_table = dma_alloc_cpu_table();
 	if (!zdev->dma_table) {
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index efd258e3f74b..17738c9f24ef 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -20,7 +20,6 @@ struct s390_domain {
 	struct iommu_domain	domain;
 	struct list_head	devices;
 	unsigned long		*dma_table;
-	spinlock_t		dma_table_lock;
 	spinlock_t		list_lock;
 };
 
@@ -61,7 +60,6 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 	s390_domain->domain.geometry.aperture_start = 0;
 	s390_domain->domain.geometry.aperture_end = ZPCI_TABLE_SIZE_RT - 1;
 
-	spin_lock_init(&s390_domain->dma_table_lock);
 	spin_lock_init(&s390_domain->list_lock);
 	INIT_LIST_HEAD_RCU(&s390_domain->devices);
 
@@ -265,14 +263,10 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
 				     unsigned long nr_pages, int flags)
 {
 	phys_addr_t page_addr = pa & PAGE_MASK;
-	unsigned long irq_flags, i;
 	unsigned long *entry;
+	unsigned long i;
 	int rc;
 
-	if (!nr_pages)
-		return 0;
-
-	spin_lock_irqsave(&s390_domain->dma_table_lock, irq_flags);
 	for (i = 0; i < nr_pages; i++) {
 		entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
 		if (unlikely(!entry)) {
@@ -283,7 +277,6 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
 		page_addr += PAGE_SIZE;
 		dma_addr += PAGE_SIZE;
 	}
-	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
 
 	return 0;
 
@@ -296,7 +289,6 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
 			break;
 		dma_update_cpu_trans(entry, 0, ZPCI_PTE_INVALID);
 	}
-	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
 
 	return rc;
 }
@@ -304,14 +296,10 @@ static int s390_iommu_validate_trans(struct s390_domain *s390_domain,
 static int s390_iommu_invalidate_trans(struct s390_domain *s390_domain,
 				       dma_addr_t dma_addr, unsigned long nr_pages)
 {
-	unsigned long irq_flags, i;
 	unsigned long *entry;
+	unsigned long i;
 	int rc = 0;
 
-	if (!nr_pages)
-		return 0;
-
-	spin_lock_irqsave(&s390_domain->dma_table_lock, irq_flags);
 	for (i = 0; i < nr_pages; i++) {
 		entry = dma_walk_cpu_trans(s390_domain->dma_table, dma_addr);
 		if (unlikely(!entry)) {
@@ -321,7 +309,6 @@ static int s390_iommu_invalidate_trans(struct s390_domain *s390_domain,
 		dma_update_cpu_trans(entry, 0, ZPCI_PTE_INVALID);
 		dma_addr += PAGE_SIZE;
 	}
-	spin_unlock_irqrestore(&s390_domain->dma_table_lock, irq_flags);
 
 	return rc;
 }
@@ -363,7 +350,8 @@ static phys_addr_t s390_iommu_iova_to_phys(struct iommu_domain *domain,
 					   dma_addr_t iova)
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
-	unsigned long *sto, *pto, *rto, flags;
+	unsigned long *rto, *sto, *pto;
+	unsigned long ste, pte, rte;
 	unsigned int rtx, sx, px;
 	phys_addr_t phys = 0;
 
@@ -376,16 +364,17 @@ static phys_addr_t s390_iommu_iova_to_phys(struct iommu_domain *domain,
 	px = calc_px(iova);
 	rto = s390_domain->dma_table;
 
-	spin_lock_irqsave(&s390_domain->dma_table_lock, flags);
-	if (rto && reg_entry_isvalid(rto[rtx])) {
-		sto = get_rt_sto(rto[rtx]);
-		if (sto && reg_entry_isvalid(sto[sx])) {
-			pto = get_st_pto(sto[sx]);
-			if (pto && pt_entry_isvalid(pto[px]))
-				phys = pto[px] & ZPCI_PTE_ADDR_MASK;
+	rte = READ_ONCE(rto[rtx]);
+	if (reg_entry_isvalid(rte)) {
+		sto = get_rt_sto(rte);
+		ste = READ_ONCE(sto[sx]);
+		if (reg_entry_isvalid(ste)) {
+			pto = get_st_pto(ste);
+			pte = READ_ONCE(pto[px]);
+			if (pt_entry_isvalid(pte))
+				phys = pte & ZPCI_PTE_ADDR_MASK;
 		}
 	}
-	spin_unlock_irqrestore(&s390_domain->dma_table_lock, flags);
 
 	return phys;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-18 14:51 ` [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration Niklas Schnelle
@ 2022-10-18 15:18   ` Jason Gunthorpe
  2022-10-19  8:31     ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-18 15:18 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:

> @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
>  		return;
>  
>  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> -	list_del_init(&zdev->iommu_list);
> +	list_del_rcu(&zdev->iommu_list);
>  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);

This doesn't seem obviously OK, the next steps remove the translation
while we can still have concurrent RCU protected flushes going on.

Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-18 15:18   ` Jason Gunthorpe
@ 2022-10-19  8:31     ` Niklas Schnelle
  2022-10-19 11:53       ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-19  8:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> 
> > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> >  		return;
> >  
> >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > -	list_del_init(&zdev->iommu_list);
> > +	list_del_rcu(&zdev->iommu_list);
> >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> 
> This doesn't seem obviously OK, the next steps remove the translation
> while we can still have concurrent RCU protected flushes going on.
> 
> Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> 
> Jason

Interesting point. So for the flushes themselves this should be fine,
once the zpci_unregister_ioat() is executed all subsequent and ongoing
IOTLB flushes should return an error code without further adverse
effects. Though I think we do still have an issue in the IOTLB ops for
this case as that error would skip the IOTLB flushes of other attached
devices.

The bigger question and that seems independent from RCU is how/if
detach is supposed to work if there are still DMAs ongoing. Once we do
the zpci_unregister_ioat() any DMA request coming from the PCI device
will be blocked and will lead to the PCI device being isolated (put
into an error state) for attempting an invalid DMA. So I had expected
that calls of detach/attach would happen without expected ongoing DMAs
and thus IOTLB flushes? Of course we should be robust against
violations of that and unexpected DMAs for which I think isolating the
PCI device is the correct response. What am I missing?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-19  8:31     ` Niklas Schnelle
@ 2022-10-19 11:53       ` Jason Gunthorpe
  2022-10-20  8:51         ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-19 11:53 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Wed, Oct 19, 2022 at 10:31:21AM +0200, Niklas Schnelle wrote:
> On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> > 
> > > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> > >  		return;
> > >  
> > >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > -	list_del_init(&zdev->iommu_list);
> > > +	list_del_rcu(&zdev->iommu_list);
> > >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > 
> > This doesn't seem obviously OK, the next steps remove the translation
> > while we can still have concurrent RCU protected flushes going on.
> > 
> > Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> > 
> > Jason
> 
> Interesting point. So for the flushes themselves this should be fine,
> once the zpci_unregister_ioat() is executed all subsequent and ongoing
> IOTLB flushes should return an error code without further adverse
> effects. Though I think we do still have an issue in the IOTLB ops for
> this case as that error would skip the IOTLB flushes of other attached
> devices.

That sounds bad


> The bigger question and that seems independent from RCU is how/if
> detach is supposed to work if there are still DMAs ongoing. Once we do
> the zpci_unregister_ioat() any DMA request coming from the PCI device
> will be blocked and will lead to the PCI device being isolated (put
> into an error state) for attempting an invalid DMA. So I had expected
> that calls of detach/attach would happen without expected ongoing DMAs
> and thus IOTLB flushes? 

"ongoing DMA" from this device shouuld be stopped, it doesn't mean
that the other devices attached to the same domain are not also still
operating and also still having flushes. So now that it is RCU a flush
triggered by a different device will continue to see this now disabled
device and try to flush it until the grace period.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-19 11:53       ` Jason Gunthorpe
@ 2022-10-20  8:51         ` Niklas Schnelle
  2022-10-20 11:05           ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-20  8:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Wed, 2022-10-19 at 08:53 -0300, Jason Gunthorpe wrote:
> On Wed, Oct 19, 2022 at 10:31:21AM +0200, Niklas Schnelle wrote:
> > On Tue, 2022-10-18 at 12:18 -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 18, 2022 at 04:51:30PM +0200, Niklas Schnelle wrote:
> > > 
> > > > @@ -84,7 +88,7 @@ static void __s390_iommu_detach_device(struct zpci_dev *zdev)
> > > >  		return;
> > > >  
> > > >  	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > > -	list_del_init(&zdev->iommu_list);
> > > > +	list_del_rcu(&zdev->iommu_list);
> > > >  	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > > 
> > > This doesn't seem obviously OK, the next steps remove the translation
> > > while we can still have concurrent RCU protected flushes going on.
> > > 
> > > Is it OK to call the flushes when after the zpci_dma_exit_device()/etc?
> > > 
> > > Jason
> > 
> > Interesting point. So for the flushes themselves this should be fine,
> > once the zpci_unregister_ioat() is executed all subsequent and ongoing
> > IOTLB flushes should return an error code without further adverse
> > effects. Though I think we do still have an issue in the IOTLB ops for
> > this case as that error would skip the IOTLB flushes of other attached
> > devices.
> 
> That sounds bad

Thankfully it's very easy to fix since our IOTLB flushes are per PCI
function, I just need to continue the loop in the IOTLB ops on error
instead of breaking out of it and skipping the other devices. Makes no
sense anyway to skip  devices just because there is an error on another
device.

> 
> 
> > The bigger question and that seems independent from RCU is how/if
> > detach is supposed to work if there are still DMAs ongoing. Once we do
> > the zpci_unregister_ioat() any DMA request coming from the PCI device
> > will be blocked and will lead to the PCI device being isolated (put
> > into an error state) for attempting an invalid DMA. So I had expected
> > that calls of detach/attach would happen without expected ongoing DMAs
> > and thus IOTLB flushes? 
> 
> "ongoing DMA" from this device shouuld be stopped, it doesn't mean
> that the other devices attached to the same domain are not also still
> operating and also still having flushes. So now that it is RCU a flush
> triggered by a different device will continue to see this now disabled
> device and try to flush it until the grace period.
> 
> Jason

Ok that makes sense thanks for the explanation. So yes my assessment is
still that in this situation the IOTLB flush is architected to return
an error that we can ignore. Not the most elegant I admit but at least
it's simple. Alternatively I guess we could use call_rcu() to do the
zpci_unregister_ioat() but I'm not sure how to then make sure that a
subsequent zpci_register_ioat() only happens after that without adding
too much more logic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-20  8:51         ` Niklas Schnelle
@ 2022-10-20 11:05           ` Jason Gunthorpe
  2022-10-21 12:08             ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-20 11:05 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:

> Ok that makes sense thanks for the explanation. So yes my assessment is
> still that in this situation the IOTLB flush is architected to return
> an error that we can ignore. Not the most elegant I admit but at least
> it's simple. Alternatively I guess we could use call_rcu() to do the
> zpci_unregister_ioat() but I'm not sure how to then make sure that a
> subsequent zpci_register_ioat() only happens after that without adding
> too much more logic.

This won't work either as the domain could have been freed before the
call_rcu() happens, the domain needs to be detached synchronously

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-20 11:05           ` Jason Gunthorpe
@ 2022-10-21 12:08             ` Niklas Schnelle
  2022-10-21 13:36               ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-21 12:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> 
> > Ok that makes sense thanks for the explanation. So yes my assessment is
> > still that in this situation the IOTLB flush is architected to return
> > an error that we can ignore. Not the most elegant I admit but at least
> > it's simple. Alternatively I guess we could use call_rcu() to do the
> > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > subsequent zpci_register_ioat() only happens after that without adding
> > too much more logic.
> 
> This won't work either as the domain could have been freed before the
> call_rcu() happens, the domain needs to be detached synchronously
> 
> Jason

Yeah right, that is basically the same issue I was thinking of for a
subsequent zpci_register_ioat(). What about the obvious one. Just call
synchronize_rcu() before zpci_unregister_ioat()?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-21 12:08             ` Niklas Schnelle
@ 2022-10-21 13:36               ` Jason Gunthorpe
  2022-10-21 15:01                 ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-21 13:36 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > 
> > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > still that in this situation the IOTLB flush is architected to return
> > > an error that we can ignore. Not the most elegant I admit but at least
> > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > subsequent zpci_register_ioat() only happens after that without adding
> > > too much more logic.
> > 
> > This won't work either as the domain could have been freed before the
> > call_rcu() happens, the domain needs to be detached synchronously
> > 
> > Jason
> 
> Yeah right, that is basically the same issue I was thinking of for a
> subsequent zpci_register_ioat(). What about the obvious one. Just call
> synchronize_rcu() before zpci_unregister_ioat()?

Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
to complete in some cases.

What you have seems like it could be OK, just deal with the ugly racy
failure

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-21 13:36               ` Jason Gunthorpe
@ 2022-10-21 15:01                 ` Niklas Schnelle
  2022-10-21 15:04                   ` Jason Gunthorpe
  2022-10-21 15:05                   ` Niklas Schnelle
  0 siblings, 2 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-21 15:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > 
> > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > still that in this situation the IOTLB flush is architected to return
> > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > too much more logic.
> > > 
> > > This won't work either as the domain could have been freed before the
> > > call_rcu() happens, the domain needs to be detached synchronously
> > > 
> > > Jason
> > 
> > Yeah right, that is basically the same issue I was thinking of for a
> > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > synchronize_rcu() before zpci_unregister_ioat()?
> 
> Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> to complete in some cases.
> 
> What you have seems like it could be OK, just deal with the ugly racy
> failure
> 
> Jason

I'd tend to go with synchronize_rcu(). It won't leave us with spurious
error logs for the failed IOTLB flushes and as you said one expects
detach to be synchronous. I don't think waiting in it will be a
problem. But this is definitely something you're more of an expert on
so I'll trust your judgement. Looking at other callers of
synchronize_rcu() quite a few of them look to be in similar
detach/release kind of situations though not sure how frequent and
performance critical IOMMU domain detaching is in comparison.

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-21 15:01                 ` Niklas Schnelle
@ 2022-10-21 15:04                   ` Jason Gunthorpe
  2022-10-24 15:22                     ` Niklas Schnelle
  2022-10-21 15:05                   ` Niklas Schnelle
  1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-21 15:04 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, Oct 21, 2022 at 05:01:32PM +0200, Niklas Schnelle wrote:
> On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > still that in this situation the IOTLB flush is architected to return
> > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > too much more logic.
> > > > 
> > > > This won't work either as the domain could have been freed before the
> > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > 
> > > > Jason
> > > 
> > > Yeah right, that is basically the same issue I was thinking of for a
> > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > synchronize_rcu() before zpci_unregister_ioat()?
> > 
> > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > to complete in some cases.
> > 
> > What you have seems like it could be OK, just deal with the ugly racy
> > failure
> > 
> > Jason
> 
> I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> error logs for the failed IOTLB flushes and as you said one expects
> detach to be synchronous. I don't think waiting in it will be a
> problem. But this is definitely something you're more of an expert on
> so I'll trust your judgement. Looking at other callers of
> synchronize_rcu() quite a few of them look to be in similar
> detach/release kind of situations though not sure how frequent and
> performance critical IOMMU domain detaching is in comparison.

I would not do it on domain detaching, that is something triggered by
userspace through VFIO and it could theoritically happen alot, eg in
vIOMMU scenarios.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-21 15:01                 ` Niklas Schnelle
  2022-10-21 15:04                   ` Jason Gunthorpe
@ 2022-10-21 15:05                   ` Niklas Schnelle
  1 sibling, 0 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-21 15:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, 2022-10-21 at 17:01 +0200, Niklas Schnelle wrote:
> On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > still that in this situation the IOTLB flush is architected to return
> > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > too much more logic.
> > > > 
> > > > This won't work either as the domain could have been freed before the
> > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > 
> > > > Jason
> > > 
> > > Yeah right, that is basically the same issue I was thinking of for a
> > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > synchronize_rcu() before zpci_unregister_ioat()?
> > 
> > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > to complete in some cases.
> > 
> > What you have seems like it could be OK, just deal with the ugly racy
> > failure
> > 
> > Jason
> 
> I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> error logs for the failed IOTLB flushes and as you said one expects
> detach to be synchronous. I don't think waiting in it will be a
> problem. But this is definitely something you're more of an expert on
> so I'll trust your judgement. Looking at other callers of
> synchronize_rcu() quite a few of them look to be in similar
> detach/release kind of situations though not sure how frequent and
> performance critical IOMMU domain detaching is in comparison.
> 
> Thanks,
> Niklas
> 

Addendum, of course independently of whether to use synchronize_rcu()
I'll change the error handling in the IOTLB ops to not skip over the
other devices.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-21 15:04                   ` Jason Gunthorpe
@ 2022-10-24 15:22                     ` Niklas Schnelle
  2022-10-24 16:26                       ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-24 15:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, 2022-10-21 at 12:04 -0300, Jason Gunthorpe wrote:
> On Fri, Oct 21, 2022 at 05:01:32PM +0200, Niklas Schnelle wrote:
> > On Fri, 2022-10-21 at 10:36 -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 21, 2022 at 02:08:02PM +0200, Niklas Schnelle wrote:
> > > > On Thu, 2022-10-20 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Oct 20, 2022 at 10:51:10AM +0200, Niklas Schnelle wrote:
> > > > > 
> > > > > > Ok that makes sense thanks for the explanation. So yes my assessment is
> > > > > > still that in this situation the IOTLB flush is architected to return
> > > > > > an error that we can ignore. Not the most elegant I admit but at least
> > > > > > it's simple. Alternatively I guess we could use call_rcu() to do the
> > > > > > zpci_unregister_ioat() but I'm not sure how to then make sure that a
> > > > > > subsequent zpci_register_ioat() only happens after that without adding
> > > > > > too much more logic.
> > > > > 
> > > > > This won't work either as the domain could have been freed before the
> > > > > call_rcu() happens, the domain needs to be detached synchronously
> > > > > 
> > > > > Jason
> > > > 
> > > > Yeah right, that is basically the same issue I was thinking of for a
> > > > subsequent zpci_register_ioat(). What about the obvious one. Just call
> > > > synchronize_rcu() before zpci_unregister_ioat()?
> > > 
> > > Ah, it can be done, but be prepared to wait >> 1s for synchronize_rcu
> > > to complete in some cases.
> > > 
> > > What you have seems like it could be OK, just deal with the ugly racy
> > > failure
> > > 
> > > Jason
> > 
> > I'd tend to go with synchronize_rcu(). It won't leave us with spurious
> > error logs for the failed IOTLB flushes and as you said one expects
> > detach to be synchronous. I don't think waiting in it will be a
> > problem. But this is definitely something you're more of an expert on
> > so I'll trust your judgement. Looking at other callers of
> > synchronize_rcu() quite a few of them look to be in similar
> > detach/release kind of situations though not sure how frequent and
> > performance critical IOMMU domain detaching is in comparison.
> 
> I would not do it on domain detaching, that is something triggered by
> userspace through VFIO and it could theoritically happen alot, eg in
> vIOMMU scenarios.
> 
> Jason

Thanks for the explanation, still would like to grok this a bit more if
you don't mind. If I do read things correctly synchronize_rcu() should
run in the conext of the VFIO ioctl in this case and shouldn't block
anything else in the kernel, correct? At least that's how I understand
the synchronize_rcu() comments and the fact that e.g.
net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
synchronize_rcu() and can be triggered from user-space too.

So we're
more worried about user-space getting slowed down rather than a Denial-
of-Service against other kernel tasks.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-24 15:22                     ` Niklas Schnelle
@ 2022-10-24 16:26                       ` Jason Gunthorpe
  2022-10-27 12:44                         ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-24 16:26 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:

> Thanks for the explanation, still would like to grok this a bit more if
> you don't mind. If I do read things correctly synchronize_rcu() should
> run in the conext of the VFIO ioctl in this case and shouldn't block
> anything else in the kernel, correct? At least that's how I understand
> the synchronize_rcu() comments and the fact that e.g.
> net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> synchronize_rcu() and can be triggered from user-space too.

Yes, but I wouldn't look in the kernel to understand if things are OK
 
> So we're
> more worried about user-space getting slowed down rather than a Denial-
> of-Service against other kernel tasks.

Yes, functionally it is OK, but for something like vfio with vIOMMU
you could be looking at several domains that have to be detached
sequentially and with grace periods > 1s you can reach multiple
seconds to complete something like a close() system call. Generally it
should be weighed carefully

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-24 16:26                       ` Jason Gunthorpe
@ 2022-10-27 12:44                         ` Niklas Schnelle
  2022-10-27 12:56                           ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-27 12:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> 
> > Thanks for the explanation, still would like to grok this a bit more if
> > you don't mind. If I do read things correctly synchronize_rcu() should
> > run in the conext of the VFIO ioctl in this case and shouldn't block
> > anything else in the kernel, correct? At least that's how I understand
> > the synchronize_rcu() comments and the fact that e.g.
> > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > synchronize_rcu() and can be triggered from user-space too.
> 
> Yes, but I wouldn't look in the kernel to understand if things are OK
>  
> > So we're
> > more worried about user-space getting slowed down rather than a Denial-
> > of-Service against other kernel tasks.
> 
> Yes, functionally it is OK, but for something like vfio with vIOMMU
> you could be looking at several domains that have to be detached
> sequentially and with grace periods > 1s you can reach multiple
> seconds to complete something like a close() system call. Generally it
> should be weighed carefully
> 
> Jason

Thanks for the detailed explanation. Then let's not put a
synchronize_rcu() in detach, as I said as long as the I/O translation
tables are there an IOTLB flush after zpci_unregister_ioat() should
result in an ignorable error. That said, I think if we don't have the
synchronize_rcu() in detach we need it in s390_domain_free() before
freeing the I/O translation tables.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-27 12:44                         ` Niklas Schnelle
@ 2022-10-27 12:56                           ` Jason Gunthorpe
  2022-10-27 13:35                             ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-27 12:56 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > 
> > > Thanks for the explanation, still would like to grok this a bit more if
> > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > anything else in the kernel, correct? At least that's how I understand
> > > the synchronize_rcu() comments and the fact that e.g.
> > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > synchronize_rcu() and can be triggered from user-space too.
> > 
> > Yes, but I wouldn't look in the kernel to understand if things are OK
> >  
> > > So we're
> > > more worried about user-space getting slowed down rather than a Denial-
> > > of-Service against other kernel tasks.
> > 
> > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > you could be looking at several domains that have to be detached
> > sequentially and with grace periods > 1s you can reach multiple
> > seconds to complete something like a close() system call. Generally it
> > should be weighed carefully
> > 
> > Jason
> 
> Thanks for the detailed explanation. Then let's not put a
> synchronize_rcu() in detach, as I said as long as the I/O translation
> tables are there an IOTLB flush after zpci_unregister_ioat() should
> result in an ignorable error. That said, I think if we don't have the
> synchronize_rcu() in detach we need it in s390_domain_free() before
> freeing the I/O translation tables.

Yes, it would be appropriate to free those using one of the rcu
free'rs, (eg kfree_rcu) not synchronize_rcu()

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-27 12:56                           ` Jason Gunthorpe
@ 2022-10-27 13:35                             ` Niklas Schnelle
  2022-10-27 14:03                               ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-27 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > 
> > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > anything else in the kernel, correct? At least that's how I understand
> > > > the synchronize_rcu() comments and the fact that e.g.
> > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > synchronize_rcu() and can be triggered from user-space too.
> > > 
> > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > >  
> > > > So we're
> > > > more worried about user-space getting slowed down rather than a Denial-
> > > > of-Service against other kernel tasks.
> > > 
> > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > you could be looking at several domains that have to be detached
> > > sequentially and with grace periods > 1s you can reach multiple
> > > seconds to complete something like a close() system call. Generally it
> > > should be weighed carefully
> > > 
> > > Jason
> > 
> > Thanks for the detailed explanation. Then let's not put a
> > synchronize_rcu() in detach, as I said as long as the I/O translation
> > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > result in an ignorable error. That said, I think if we don't have the
> > synchronize_rcu() in detach we need it in s390_domain_free() before
> > freeing the I/O translation tables.
> 
> Yes, it would be appropriate to free those using one of the rcu
> free'rs, (eg kfree_rcu) not synchronize_rcu()
> 
> Jason

They are allocated via kmem_cache_alloc() from caches shared by all
IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
entire I/O translation table of one IOMMU at once after it is not used
anymore. Before that it is only grown. So I think synchronize_rcu() is
the obvious and simple choice since we only need one grace period.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-27 13:35                             ` Niklas Schnelle
@ 2022-10-27 14:03                               ` Jason Gunthorpe
  2022-10-28  9:29                                 ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-27 14:03 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, Oct 27, 2022 at 03:35:57PM +0200, Niklas Schnelle wrote:
> On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > > 
> > > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > > anything else in the kernel, correct? At least that's how I understand
> > > > > the synchronize_rcu() comments and the fact that e.g.
> > > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > > synchronize_rcu() and can be triggered from user-space too.
> > > > 
> > > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > > >  
> > > > > So we're
> > > > > more worried about user-space getting slowed down rather than a Denial-
> > > > > of-Service against other kernel tasks.
> > > > 
> > > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > > you could be looking at several domains that have to be detached
> > > > sequentially and with grace periods > 1s you can reach multiple
> > > > seconds to complete something like a close() system call. Generally it
> > > > should be weighed carefully
> > > > 
> > > > Jason
> > > 
> > > Thanks for the detailed explanation. Then let's not put a
> > > synchronize_rcu() in detach, as I said as long as the I/O translation
> > > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > > result in an ignorable error. That said, I think if we don't have the
> > > synchronize_rcu() in detach we need it in s390_domain_free() before
> > > freeing the I/O translation tables.
> > 
> > Yes, it would be appropriate to free those using one of the rcu
> > free'rs, (eg kfree_rcu) not synchronize_rcu()
> > 
> > Jason
> 
> They are allocated via kmem_cache_alloc() from caches shared by all
> IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
> entire I/O translation table of one IOMMU at once after it is not used
> anymore. Before that it is only grown. So I think synchronize_rcu() is
> the obvious and simple choice since we only need one grace period.

It has the same issue as doing it for the other reason, adding
synchronize_rcu() to the domain free path is undesirable.

The best thing is to do as kfree_rcu() does now, basically:

rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
if (!rcu_head)
   synchronize_rcu()
else
   call_rcu(rcu_head)

And then call kmem_cache_free() from the rcu callback

But this is getting very complicated, you might be better to refcount
the domain itself and acquire the refcount under RCU. This turns the
locking problem into a per-domain-object lock instead of a global lock
which is usually good enough and simpler to understand.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-27 14:03                               ` Jason Gunthorpe
@ 2022-10-28  9:29                                 ` Niklas Schnelle
  2022-10-28 11:28                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Niklas Schnelle @ 2022-10-28  9:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Thu, 2022-10-27 at 11:03 -0300, Jason Gunthorpe wrote:
> On Thu, Oct 27, 2022 at 03:35:57PM +0200, Niklas Schnelle wrote:
> > On Thu, 2022-10-27 at 09:56 -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 27, 2022 at 02:44:49PM +0200, Niklas Schnelle wrote:
> > > > On Mon, 2022-10-24 at 13:26 -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Oct 24, 2022 at 05:22:24PM +0200, Niklas Schnelle wrote:
> > > > > 
> > > > > > Thanks for the explanation, still would like to grok this a bit more if
> > > > > > you don't mind. If I do read things correctly synchronize_rcu() should
> > > > > > run in the conext of the VFIO ioctl in this case and shouldn't block
> > > > > > anything else in the kernel, correct? At least that's how I understand
> > > > > > the synchronize_rcu() comments and the fact that e.g.
> > > > > > net/vmw_vsock/virtio_transport.c:virtio_vsock_remove() also does a
> > > > > > synchronize_rcu() and can be triggered from user-space too.
> > > > > 
> > > > > Yes, but I wouldn't look in the kernel to understand if things are OK
> > > > >  
> > > > > > So we're
> > > > > > more worried about user-space getting slowed down rather than a Denial-
> > > > > > of-Service against other kernel tasks.
> > > > > 
> > > > > Yes, functionally it is OK, but for something like vfio with vIOMMU
> > > > > you could be looking at several domains that have to be detached
> > > > > sequentially and with grace periods > 1s you can reach multiple
> > > > > seconds to complete something like a close() system call. Generally it
> > > > > should be weighed carefully
> > > > > 
> > > > > Jason
> > > > 
> > > > Thanks for the detailed explanation. Then let's not put a
> > > > synchronize_rcu() in detach, as I said as long as the I/O translation
> > > > tables are there an IOTLB flush after zpci_unregister_ioat() should
> > > > result in an ignorable error. That said, I think if we don't have the
> > > > synchronize_rcu() in detach we need it in s390_domain_free() before
> > > > freeing the I/O translation tables.
> > > 
> > > Yes, it would be appropriate to free those using one of the rcu
> > > free'rs, (eg kfree_rcu) not synchronize_rcu()
> > > 
> > > Jason
> > 
> > They are allocated via kmem_cache_alloc() from caches shared by all
> > IOMMU's so can't use kfree_rcu() directly. Also we're only freeing the
> > entire I/O translation table of one IOMMU at once after it is not used
> > anymore. Before that it is only grown. So I think synchronize_rcu() is
> > the obvious and simple choice since we only need one grace period.
> 
> It has the same issue as doing it for the other reason, adding
> synchronize_rcu() to the domain free path is undesirable.
> 
> The best thing is to do as kfree_rcu() does now, basically:
> 
> rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
> if (!rcu_head)
>    synchronize_rcu()
> else
>    call_rcu(rcu_head)
> 
> And then call kmem_cache_free() from the rcu callback

Hmm, maybe a stupid question but why can't I just put the rcu_head in
struct s390_domain and then do a call_rcu() on that with a callback
that does:

	dma_cleanup_tables(s390_domain->dma_table);
	kfree(s390_domain);

I.e. the rest of the current s390_domain_free().
Then I don't have to worry about failing to allocate the rcu_head and
it's simple enough. Basically just do the actual freeing of the
s390_domain via call_rcu().

> 
> But this is getting very complicated, you might be better to refcount
> the domain itself and acquire the refcount under RCU. This turns the
> locking problem into a per-domain-object lock instead of a global lock
> which is usually good enough and simpler to understand.
> 
> Jason

Sorry I might be a bit slow as I'm new to RCU but I don't understand
this yet, especially the last part. Before this patch we do have a per-
domain lock but I'm sure that's not the kind of "per-domain-object
lock" you're talking about or else we wouldn't need RCU at all. Is this
maybe a different way of expressing the above idea using the analogy
with reference counting from whatisRCU.rst? Meaning we treat the fact
that there may still be RCU readers as "there are still references to
s390_domain"? 

Or do you mean to use a kref that is taken by RCU readers together with
rcu_read_lock() and dropped at rcu_read_unlock() such that during the
RCU read critical sections the refcount can't fall below 1 and the
domain is actually freed once we have a) put the initial reference
during s390_domain_free() and b) put all temporary references on
exiting the RCU read critical sections?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration
  2022-10-28  9:29                                 ` Niklas Schnelle
@ 2022-10-28 11:28                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2022-10-28 11:28 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Matthew Rosato, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Fri, Oct 28, 2022 at 11:29:00AM +0200, Niklas Schnelle wrote:

> > rcu_head = kzalloc(rcu_head, GFP_NOWAIT, GFP_NOWARN)
> > if (!rcu_head)
> >    synchronize_rcu()
> > else
> >    call_rcu(rcu_head)
> > 
> > And then call kmem_cache_free() from the rcu callback
> 
> Hmm, maybe a stupid question but why can't I just put the rcu_head in
> struct s390_domain and then do a call_rcu() on that with a callback
> that does:
> 
> 	dma_cleanup_tables(s390_domain->dma_table);
> 	kfree(s390_domain);
> 
> I.e. the rest of the current s390_domain_free().
> Then I don't have to worry about failing to allocate the rcu_head and
> it's simple enough. Basically just do the actual freeing of the
> s390_domain via call_rcu().

Oh, if you never reallocate the dma_table then yes that is a good idea

> Or do you mean to use a kref that is taken by RCU readers together with
> rcu_read_lock() and dropped at rcu_read_unlock() such that during the
> RCU read critical sections the refcount can't fall below 1 and the
> domain is actually freed once we have a) put the initial reference
> during s390_domain_free() and b) put all temporary references on
> exiting the RCU read critical sections?

Yes, this is a common pattern. Usually you want to optimize away the
global lock that protects, say, a linked list and then accept a local
lock/refcount inside the object

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state
  2022-10-18 14:51 ` [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state Niklas Schnelle
@ 2022-10-28 15:55   ` Matthew Rosato
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Rosato @ 2022-10-28 15:55 UTC (permalink / raw)
  To: Niklas Schnelle, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On 10/18/22 10:51 AM, Niklas Schnelle wrote:
> If a zPCI device is in the error state while switching IOMMU domains
> zpci_register_ioat() will fail and we would end up with the device not
> attached to any domain. In this state since zdev->dma_table == NULL
> a reset via zpci_hot_reset_device() would wrongfully re-initialize the
> device for DMA API usage using zpci_dma_init_device(). As automatic
> recovery is currently disabled while attached to an IOMMU domain this
> only affects slot resets triggered through other means but will affect
> automatic recovery once we switch to using dma-iommu.
> 
> Additionally with that switch common code expects attaching to the
> default domain to always work so zpci_register_ioat() should only fail
> if there is no chance to recover anyway, e.g. if the device has been
> unplugged.
> 
> Improve the robustness of attach by specifically looking at the status
> returned by zpci_mod_fc() to determine if the device is unavailable and
> in this case simply ignore the error. Once the device is reset
> zpci_hot_reset_device() will then correctly set the domain's DMA
> translation tables.
> 
> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>

Seems reasonable to me.

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

> ---
>  arch/s390/include/asm/pci.h |  2 +-
>  arch/s390/kvm/pci.c         |  6 ++++--
>  arch/s390/pci/pci.c         | 11 ++++++-----
>  arch/s390/pci/pci_dma.c     |  3 ++-
>  drivers/iommu/s390-iommu.c  |  9 +++++++--
>  5 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 15f8714ca9b7..07361e2fd8c5 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -221,7 +221,7 @@ void zpci_device_reserved(struct zpci_dev *zdev);
>  bool zpci_is_device_configured(struct zpci_dev *zdev);
>  
>  int zpci_hot_reset_device(struct zpci_dev *zdev);
> -int zpci_register_ioat(struct zpci_dev *, u8, u64, u64, u64);
> +int zpci_register_ioat(struct zpci_dev *, u8, u64, u64, u64, u8 *);
>  int zpci_unregister_ioat(struct zpci_dev *, u8);
>  void zpci_remove_reserved_devices(void);
>  void zpci_update_fh(struct zpci_dev *zdev, u32 fh);
> diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
> index c50c1645c0ae..03964c0e1fdf 100644
> --- a/arch/s390/kvm/pci.c
> +++ b/arch/s390/kvm/pci.c
> @@ -434,6 +434,7 @@ static void kvm_s390_pci_dev_release(struct zpci_dev *zdev)
>  static int kvm_s390_pci_register_kvm(void *opaque, struct kvm *kvm)
>  {
>  	struct zpci_dev *zdev = opaque;
> +	u8 status;
>  	int rc;
>  
>  	if (!zdev)
> @@ -486,7 +487,7 @@ static int kvm_s390_pci_register_kvm(void *opaque, struct kvm *kvm)
>  
>  	/* Re-register the IOMMU that was already created */
>  	rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> -				virt_to_phys(zdev->dma_table));
> +				virt_to_phys(zdev->dma_table), &status);
>  	if (rc)
>  		goto clear_gisa;
>  
> @@ -516,6 +517,7 @@ static void kvm_s390_pci_unregister_kvm(void *opaque)
>  {
>  	struct zpci_dev *zdev = opaque;
>  	struct kvm *kvm;
> +	u8 status;
>  
>  	if (!zdev)
>  		return;
> @@ -554,7 +556,7 @@ static void kvm_s390_pci_unregister_kvm(void *opaque)
>  
>  	/* Re-register the IOMMU that was already created */
>  	zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> -			   virt_to_phys(zdev->dma_table));
> +			   virt_to_phys(zdev->dma_table), &status);
>  
>  out:
>  	spin_lock(&kvm->arch.kzdev_list_lock);
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 73cdc5539384..a703dcd94a68 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -116,20 +116,20 @@ EXPORT_SYMBOL_GPL(pci_proc_domain);
>  
>  /* Modify PCI: Register I/O address translation parameters */
>  int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
> -		       u64 base, u64 limit, u64 iota)
> +		       u64 base, u64 limit, u64 iota, u8 *status)
>  {
>  	u64 req = ZPCI_CREATE_REQ(zdev->fh, dmaas, ZPCI_MOD_FC_REG_IOAT);
>  	struct zpci_fib fib = {0};
> -	u8 cc, status;
> +	u8 cc;
>  
>  	WARN_ON_ONCE(iota & 0x3fff);
>  	fib.pba = base;
>  	fib.pal = limit;
>  	fib.iota = iota | ZPCI_IOTA_RTTO_FLAG;
>  	fib.gd = zdev->gisa;
> -	cc = zpci_mod_fc(req, &fib, &status);
> +	cc = zpci_mod_fc(req, &fib, status);
>  	if (cc)
> -		zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
> +		zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, *status);
>  	return cc;
>  }
>  EXPORT_SYMBOL_GPL(zpci_register_ioat);
> @@ -764,6 +764,7 @@ EXPORT_SYMBOL_GPL(zpci_disable_device);
>   */
>  int zpci_hot_reset_device(struct zpci_dev *zdev)
>  {
> +	u8 status;
>  	int rc;
>  
>  	zpci_dbg(3, "rst fid:%x, fh:%x\n", zdev->fid, zdev->fh);
> @@ -787,7 +788,7 @@ int zpci_hot_reset_device(struct zpci_dev *zdev)
>  
>  	if (zdev->dma_table)
>  		rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> -					virt_to_phys(zdev->dma_table));
> +					virt_to_phys(zdev->dma_table), &status);
>  	else
>  		rc = zpci_dma_init_device(zdev);
>  	if (rc) {
> diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
> index 227cf0a62800..dee825ee7305 100644
> --- a/arch/s390/pci/pci_dma.c
> +++ b/arch/s390/pci/pci_dma.c
> @@ -547,6 +547,7 @@ static void s390_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>  	
>  int zpci_dma_init_device(struct zpci_dev *zdev)
>  {
> +	u8 status;
>  	int rc;
>  
>  	/*
> @@ -598,7 +599,7 @@ int zpci_dma_init_device(struct zpci_dev *zdev)
>  
>  	}
>  	if (zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> -			       virt_to_phys(zdev->dma_table))) {
> +			       virt_to_phys(zdev->dma_table), &status)) {
>  		rc = -EIO;
>  		goto free_bitmap;
>  	}
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 6c407b61b25a..ee88e717254b 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -98,6 +98,7 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
>  	struct s390_domain *s390_domain = to_s390_domain(domain);
>  	struct zpci_dev *zdev = to_zpci_dev(dev);
>  	unsigned long flags;
> +	u8 status;
>  	int cc;
>  
>  	if (!zdev)
> @@ -113,8 +114,12 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
>  		zpci_dma_exit_device(zdev);
>  
>  	cc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> -				virt_to_phys(s390_domain->dma_table));
> -	if (cc)
> +				virt_to_phys(s390_domain->dma_table), &status);
> +	/*
> +	 * If the device is undergoing error recovery the reset code
> +	 * will re-establish the new domain.
> +	 */
> +	if (cc && status != ZPCI_PCI_ST_FUNC_NOT_AVAIL)
>  		return -EIO;
>  	zdev->dma_table = s390_domain->dma_table;
>  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] iommu/s390: Add I/O TLB ops
  2022-10-18 14:51 ` [PATCH 2/5] iommu/s390: Add I/O TLB ops Niklas Schnelle
@ 2022-10-28 16:03   ` Matthew Rosato
  2022-10-31 16:11     ` Robin Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Rosato @ 2022-10-28 16:03 UTC (permalink / raw)
  To: Niklas Schnelle, iommu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On 10/18/22 10:51 AM, Niklas Schnelle wrote:
> Currently s390-iommu does an I/O TLB flush (RPCIT) for every update of
> the I/O translation table explicitly. For one this is wasteful since
> RPCIT can be skipped after a mapping operation if zdev->tlb_refresh is
> unset. Moreover we can do a single RPCIT for a range of pages including
> whne doing lazy unmapping.
> 
> Thankfully both of these optimizations can be achieved by implementing
> the IOMMU operations common code provides for the different types of I/O
> tlb flushes:
> 
>  * flush_iotlb_all: Flushes the I/O TLB for the entire IOVA space
>  * iotlb_sync:  Flushes the I/O TLB for a range of pages that can be
>    gathered up, for example to implement lazy unmapping.
>  * iotlb_sync_map: Flushes the I/O TLB after a mapping operation
> 
> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> ---
>  drivers/iommu/s390-iommu.c | 76 ++++++++++++++++++++++++++++++++------
>  1 file changed, 65 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index ee88e717254b..a4c2e9bc6d83 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -199,14 +199,72 @@ static void s390_iommu_release_device(struct device *dev)
>  		__s390_iommu_detach_device(zdev);
>  }
>  
> +static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +	struct s390_domain *s390_domain = to_s390_domain(domain);
> +	struct zpci_dev *zdev;
> +	unsigned long flags;
> +	int rc;
> +
> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> +		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
> +					zdev->end_dma - zdev->start_dma + 1);
> +		if (rc)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> +}
> +
> +static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
> +				  struct iommu_iotlb_gather *gather)
> +{
> +	struct s390_domain *s390_domain = to_s390_domain(domain);
> +	size_t size = gather->end - gather->start + 1;
> +	struct zpci_dev *zdev;
> +	unsigned long flags;
> +	int rc;
> +
> +	/* If gather was never added to there is nothing to flush */
> +	if (gather->start == ULONG_MAX)
> +		return;

Hmm, this seems a little awkward in that it depends on the init value in iommu_iotlb_gather_init never changing.  I don't see any other iommu drivers doing this -- Is there no other way to tell there's nothing to flush?

If we really need to do this, maybe some shared #define in iommu.h that is used in iommu_iotlb_gather_init and here?

> +
> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> +		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
> +					size);
> +		if (rc)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> +}
> +
> +static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
> +				      unsigned long iova, size_t size)
> +{
> +	struct s390_domain *s390_domain = to_s390_domain(domain);
> +	struct zpci_dev *zdev;
> +	unsigned long flags;
> +	int rc;
> +
> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> +		if (!zdev->tlb_refresh)
> +			continue;
> +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> +					iova, size);
> +		if (rc)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> +}
> +
>  static int s390_iommu_update_trans(struct s390_domain *s390_domain,
>  				   phys_addr_t pa, dma_addr_t dma_addr,
>  				   unsigned long nr_pages, int flags)
>  {
>  	phys_addr_t page_addr = pa & PAGE_MASK;
> -	dma_addr_t start_dma_addr = dma_addr;
>  	unsigned long irq_flags, i;
> -	struct zpci_dev *zdev;
>  	unsigned long *entry;
>  	int rc = 0;
>  
> @@ -225,15 +283,6 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
>  		dma_addr += PAGE_SIZE;
>  	}
>  
> -	spin_lock(&s390_domain->list_lock);
> -	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> -		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> -					start_dma_addr, nr_pages * PAGE_SIZE);
> -		if (rc)
> -			break;
> -	}
> -	spin_unlock(&s390_domain->list_lock);
> -
>  undo_cpu_trans:
>  	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
>  		flags = ZPCI_PTE_INVALID;
> @@ -340,6 +389,8 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
>  	if (rc)
>  		return 0;
>  
> +	iommu_iotlb_gather_add_range(gather, iova, size);
> +
>  	return size;
>  }
>  
> @@ -384,6 +435,9 @@ static const struct iommu_ops s390_iommu_ops = {
>  		.detach_dev	= s390_iommu_detach_device,
>  		.map_pages	= s390_iommu_map_pages,
>  		.unmap_pages	= s390_iommu_unmap_pages,
> +		.flush_iotlb_all = s390_iommu_flush_iotlb_all,
> +		.iotlb_sync      = s390_iommu_iotlb_sync,
> +		.iotlb_sync_map  = s390_iommu_iotlb_sync_map,
>  		.iova_to_phys	= s390_iommu_iova_to_phys,
>  		.free		= s390_domain_free,
>  	}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] iommu/s390: Add I/O TLB ops
  2022-10-28 16:03   ` Matthew Rosato
@ 2022-10-31 16:11     ` Robin Murphy
  2022-11-02 10:51       ` Niklas Schnelle
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2022-10-31 16:11 UTC (permalink / raw)
  To: Matthew Rosato, Niklas Schnelle, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On 2022-10-28 17:03, Matthew Rosato wrote:
> On 10/18/22 10:51 AM, Niklas Schnelle wrote:
>> Currently s390-iommu does an I/O TLB flush (RPCIT) for every update of
>> the I/O translation table explicitly. For one this is wasteful since
>> RPCIT can be skipped after a mapping operation if zdev->tlb_refresh is
>> unset. Moreover we can do a single RPCIT for a range of pages including
>> whne doing lazy unmapping.
>>
>> Thankfully both of these optimizations can be achieved by implementing
>> the IOMMU operations common code provides for the different types of I/O
>> tlb flushes:
>>
>>   * flush_iotlb_all: Flushes the I/O TLB for the entire IOVA space
>>   * iotlb_sync:  Flushes the I/O TLB for a range of pages that can be
>>     gathered up, for example to implement lazy unmapping.
>>   * iotlb_sync_map: Flushes the I/O TLB after a mapping operation
>>
>> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
>> ---
>>   drivers/iommu/s390-iommu.c | 76 ++++++++++++++++++++++++++++++++------
>>   1 file changed, 65 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
>> index ee88e717254b..a4c2e9bc6d83 100644
>> --- a/drivers/iommu/s390-iommu.c
>> +++ b/drivers/iommu/s390-iommu.c
>> @@ -199,14 +199,72 @@ static void s390_iommu_release_device(struct device *dev)
>>   		__s390_iommu_detach_device(zdev);
>>   }
>>   
>> +static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
>> +{
>> +	struct s390_domain *s390_domain = to_s390_domain(domain);
>> +	struct zpci_dev *zdev;
>> +	unsigned long flags;
>> +	int rc;
>> +
>> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
>> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
>> +		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
>> +					zdev->end_dma - zdev->start_dma + 1);
>> +		if (rc)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
>> +}
>> +
>> +static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
>> +				  struct iommu_iotlb_gather *gather)
>> +{
>> +	struct s390_domain *s390_domain = to_s390_domain(domain);
>> +	size_t size = gather->end - gather->start + 1;
>> +	struct zpci_dev *zdev;
>> +	unsigned long flags;
>> +	int rc;
>> +
>> +	/* If gather was never added to there is nothing to flush */
>> +	if (gather->start == ULONG_MAX)
>> +		return;
> 
> Hmm, this seems a little awkward in that it depends on the init value in iommu_iotlb_gather_init never changing.  I don't see any other iommu drivers doing this -- Is there no other way to tell there's nothing to flush?
> 
> If we really need to do this, maybe some shared #define in iommu.h that is used in iommu_iotlb_gather_init and here?

If you can trust yourselves to never gather a single byte (which by 
construction should be impossible), "!gather->end" is perhaps a tiny bit 
more robust (and consistent with iommu_iotlb_gather_is_disjoint()), 
although given the way that iommu_iotlb_gather_add_*() work I don't 
think either initial value has much chance of changing in practice, 
short of some larger refactoring that would likely have to touch all the 
users anyway. If you still want to be as foolproof as possible, using 
"gather->start > gather->end" would represent the most general form of 
the initial conditions.

FWIW, SMMUv3 does also check for an empty range, but using 
gather->pgsize that is only relevant with add_page(). The other gather 
users seem happy to go ahead and just issue whatever wacky invalidation 
command those initial values end up looking like. I think an empty sync 
should really only happen in unexpected conditions like an unmap 
failing, so it shouldn't be a case that deserves a great deal of 
optimisation effort.

Thanks,
Robin.

>> +
>> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
>> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
>> +		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
>> +					size);
>> +		if (rc)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
>> +}
>> +
>> +static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
>> +				      unsigned long iova, size_t size)
>> +{
>> +	struct s390_domain *s390_domain = to_s390_domain(domain);
>> +	struct zpci_dev *zdev;
>> +	unsigned long flags;
>> +	int rc;
>> +
>> +	spin_lock_irqsave(&s390_domain->list_lock, flags);
>> +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
>> +		if (!zdev->tlb_refresh)
>> +			continue;
>> +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
>> +					iova, size);
>> +		if (rc)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
>> +}
>> +
>>   static int s390_iommu_update_trans(struct s390_domain *s390_domain,
>>   				   phys_addr_t pa, dma_addr_t dma_addr,
>>   				   unsigned long nr_pages, int flags)
>>   {
>>   	phys_addr_t page_addr = pa & PAGE_MASK;
>> -	dma_addr_t start_dma_addr = dma_addr;
>>   	unsigned long irq_flags, i;
>> -	struct zpci_dev *zdev;
>>   	unsigned long *entry;
>>   	int rc = 0;
>>   
>> @@ -225,15 +283,6 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
>>   		dma_addr += PAGE_SIZE;
>>   	}
>>   
>> -	spin_lock(&s390_domain->list_lock);
>> -	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
>> -		rc = zpci_refresh_trans((u64)zdev->fh << 32,
>> -					start_dma_addr, nr_pages * PAGE_SIZE);
>> -		if (rc)
>> -			break;
>> -	}
>> -	spin_unlock(&s390_domain->list_lock);
>> -
>>   undo_cpu_trans:
>>   	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
>>   		flags = ZPCI_PTE_INVALID;
>> @@ -340,6 +389,8 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
>>   	if (rc)
>>   		return 0;
>>   
>> +	iommu_iotlb_gather_add_range(gather, iova, size);
>> +
>>   	return size;
>>   }
>>   
>> @@ -384,6 +435,9 @@ static const struct iommu_ops s390_iommu_ops = {
>>   		.detach_dev	= s390_iommu_detach_device,
>>   		.map_pages	= s390_iommu_map_pages,
>>   		.unmap_pages	= s390_iommu_unmap_pages,
>> +		.flush_iotlb_all = s390_iommu_flush_iotlb_all,
>> +		.iotlb_sync      = s390_iommu_iotlb_sync,
>> +		.iotlb_sync_map  = s390_iommu_iotlb_sync_map,
>>   		.iova_to_phys	= s390_iommu_iova_to_phys,
>>   		.free		= s390_domain_free,
>>   	}
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] iommu/s390: Add I/O TLB ops
  2022-10-31 16:11     ` Robin Murphy
@ 2022-11-02 10:51       ` Niklas Schnelle
  0 siblings, 0 replies; 28+ messages in thread
From: Niklas Schnelle @ 2022-11-02 10:51 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, iommu, Joerg Roedel, Will Deacon,
	Jason Gunthorpe
  Cc: Gerd Bayer, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel

On Mon, 2022-10-31 at 16:11 +0000, Robin Murphy wrote:
> On 2022-10-28 17:03, Matthew Rosato wrote:
> > On 10/18/22 10:51 AM, Niklas Schnelle wrote:
> > > Currently s390-iommu does an I/O TLB flush (RPCIT) for every update of
> > > the I/O translation table explicitly. For one this is wasteful since
> > > RPCIT can be skipped after a mapping operation if zdev->tlb_refresh is
> > > unset. Moreover we can do a single RPCIT for a range of pages including
> > > whne doing lazy unmapping.
> > > 
> > > Thankfully both of these optimizations can be achieved by implementing
> > > the IOMMU operations common code provides for the different types of I/O
> > > tlb flushes:
> > > 
> > >   * flush_iotlb_all: Flushes the I/O TLB for the entire IOVA space
> > >   * iotlb_sync:  Flushes the I/O TLB for a range of pages that can be
> > >     gathered up, for example to implement lazy unmapping.
> > >   * iotlb_sync_map: Flushes the I/O TLB after a mapping operation
> > > 
> > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > ---
> > >   drivers/iommu/s390-iommu.c | 76 ++++++++++++++++++++++++++++++++------
> > >   1 file changed, 65 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> > > index ee88e717254b..a4c2e9bc6d83 100644
> > > --- a/drivers/iommu/s390-iommu.c
> > > +++ b/drivers/iommu/s390-iommu.c
> > > @@ -199,14 +199,72 @@ static void s390_iommu_release_device(struct device *dev)
> > >   		__s390_iommu_detach_device(zdev);
> > >   }
> > >   
> > > +static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
> > > +{
> > > +	struct s390_domain *s390_domain = to_s390_domain(domain);
> > > +	struct zpci_dev *zdev;
> > > +	unsigned long flags;
> > > +	int rc;
> > > +
> > > +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> > > +		rc = zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
> > > +					zdev->end_dma - zdev->start_dma + 1);
> > > +		if (rc)
> > > +			break;
> > > +	}
> > > +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > > +}
> > > +
> > > +static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
> > > +				  struct iommu_iotlb_gather *gather)
> > > +{
> > > +	struct s390_domain *s390_domain = to_s390_domain(domain);
> > > +	size_t size = gather->end - gather->start + 1;
> > > +	struct zpci_dev *zdev;
> > > +	unsigned long flags;
> > > +	int rc;
> > > +
> > > +	/* If gather was never added to there is nothing to flush */
> > > +	if (gather->start == ULONG_MAX)
> > > +		return;
> > 
> > Hmm, this seems a little awkward in that it depends on the init value in iommu_iotlb_gather_init never changing.  I don't see any other iommu drivers doing this -- Is there no other way to tell there's nothing to flush?
> > 
> > If we really need to do this, maybe some shared #define in iommu.h that is used in iommu_iotlb_gather_init and here?
> 
> If you can trust yourselves to never gather a single byte (which by 
> construction should be impossible), "!gather->end" is perhaps a tiny bit 
> more robust (and consistent with iommu_iotlb_gather_is_disjoint()), 
> although given the way that iommu_iotlb_gather_add_*() work I don't 
> think either initial value has much chance of changing in practice, 
> short of some larger refactoring that would likely have to touch all the 
> users anyway. If you still want to be as foolproof as possible, using 
> "gather->start > gather->end" would represent the most general form of 
> the initial conditions.
> 
> FWIW, SMMUv3 does also check for an empty range, but using 
> gather->pgsize that is only relevant with add_page(). The other gather 
> users seem happy to go ahead and just issue whatever wacky invalidation 
> command those initial values end up looking like. I think an empty sync 
> should really only happen in unexpected conditions like an unmap 
> failing, so it shouldn't be a case that deserves a great deal of 
> optimisation effort.
> 
> Thanks,
> Robin.
> 

Yeah I agree this should only happen when unmap failed. I think I added
this when I was playing around with adding an intermediate flush
similar to what amd_iommu_iotlb_gather_add_page() does, only that in
some intermediate stages I could end up with nothing left to flush.
That whole optimization did turn out not to help and I removed it
again. I think even if it's only for the error case now, I'd like to
keep it though. This makes sure we don't get weirdly sized flushes in
the error case. I'll use '!gather->end' to be consistent with
iommu_iotlb_gather_is_disjoint() as you suggested.

Speaking of that AMD optimization, I'm actually not sure that it does
the right thing for AMD either. The way I read the code, it does more
but only contiguous TLB flushes in virtualized mode and at least for us
this turned out detrimental. Also the comment, at least to me, makes it
sound as if they were trying for fewer flushes but it's worded a bit
confusingly so not sure.

Thanks,
Niklas

> > > +
> > > +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> > > +		rc = zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
> > > +					size);
> > > +		if (rc)
> > > +			break;
> > > +	}
> > > +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > > +}
> > > +
> > > +static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
> > > +				      unsigned long iova, size_t size)
> > > +{
> > > +	struct s390_domain *s390_domain = to_s390_domain(domain);
> > > +	struct zpci_dev *zdev;
> > > +	unsigned long flags;
> > > +	int rc;
> > > +
> > > +	spin_lock_irqsave(&s390_domain->list_lock, flags);
> > > +	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> > > +		if (!zdev->tlb_refresh)
> > > +			continue;
> > > +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> > > +					iova, size);
> > > +		if (rc)
> > > +			break;
> > > +	}
> > > +	spin_unlock_irqrestore(&s390_domain->list_lock, flags);
> > > +}
> > > +
> > >   static int s390_iommu_update_trans(struct s390_domain *s390_domain,
> > >   				   phys_addr_t pa, dma_addr_t dma_addr,
> > >   				   unsigned long nr_pages, int flags)
> > >   {
> > >   	phys_addr_t page_addr = pa & PAGE_MASK;
> > > -	dma_addr_t start_dma_addr = dma_addr;
> > >   	unsigned long irq_flags, i;
> > > -	struct zpci_dev *zdev;
> > >   	unsigned long *entry;
> > >   	int rc = 0;
> > >   
> > > @@ -225,15 +283,6 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
> > >   		dma_addr += PAGE_SIZE;
> > >   	}
> > >   
> > > -	spin_lock(&s390_domain->list_lock);
> > > -	list_for_each_entry(zdev, &s390_domain->devices, iommu_list) {
> > > -		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> > > -					start_dma_addr, nr_pages * PAGE_SIZE);
> > > -		if (rc)
> > > -			break;
> > > -	}
> > > -	spin_unlock(&s390_domain->list_lock);
> > > -
> > >   undo_cpu_trans:
> > >   	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
> > >   		flags = ZPCI_PTE_INVALID;
> > > @@ -340,6 +389,8 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
> > >   	if (rc)
> > >   		return 0;
> > >   
> > > +	iommu_iotlb_gather_add_range(gather, iova, size);
> > > +
> > >   	return size;
> > >   }
> > >   
> > > @@ -384,6 +435,9 @@ static const struct iommu_ops s390_iommu_ops = {
> > >   		.detach_dev	= s390_iommu_detach_device,
> > >   		.map_pages	= s390_iommu_map_pages,
> > >   		.unmap_pages	= s390_iommu_unmap_pages,
> > > +		.flush_iotlb_all = s390_iommu_flush_iotlb_all,
> > > +		.iotlb_sync      = s390_iommu_iotlb_sync,
> > > +		.iotlb_sync_map  = s390_iommu_iotlb_sync_map,
> > >   		.iova_to_phys	= s390_iommu_iova_to_phys,
> > >   		.free		= s390_domain_free,
> > >   	}



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-11-02 10:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-18 14:51 [PATCH 0/5] iommu/s390: Further improvements Niklas Schnelle
2022-10-18 14:51 ` [PATCH 1/5] iommu/s390: Make attach succeed even if the device is in error state Niklas Schnelle
2022-10-28 15:55   ` Matthew Rosato
2022-10-18 14:51 ` [PATCH 2/5] iommu/s390: Add I/O TLB ops Niklas Schnelle
2022-10-28 16:03   ` Matthew Rosato
2022-10-31 16:11     ` Robin Murphy
2022-11-02 10:51       ` Niklas Schnelle
2022-10-18 14:51 ` [PATCH 3/5] iommu/s390: Use RCU to allow concurrent domain_list iteration Niklas Schnelle
2022-10-18 15:18   ` Jason Gunthorpe
2022-10-19  8:31     ` Niklas Schnelle
2022-10-19 11:53       ` Jason Gunthorpe
2022-10-20  8:51         ` Niklas Schnelle
2022-10-20 11:05           ` Jason Gunthorpe
2022-10-21 12:08             ` Niklas Schnelle
2022-10-21 13:36               ` Jason Gunthorpe
2022-10-21 15:01                 ` Niklas Schnelle
2022-10-21 15:04                   ` Jason Gunthorpe
2022-10-24 15:22                     ` Niklas Schnelle
2022-10-24 16:26                       ` Jason Gunthorpe
2022-10-27 12:44                         ` Niklas Schnelle
2022-10-27 12:56                           ` Jason Gunthorpe
2022-10-27 13:35                             ` Niklas Schnelle
2022-10-27 14:03                               ` Jason Gunthorpe
2022-10-28  9:29                                 ` Niklas Schnelle
2022-10-28 11:28                                   ` Jason Gunthorpe
2022-10-21 15:05                   ` Niklas Schnelle
2022-10-18 14:51 ` [PATCH 4/5] iommu/s390: Optimize IOMMU table walking Niklas Schnelle
2022-10-18 14:51 ` [PATCH 5/5] s390/pci: use lock-free I/O translation updates Niklas Schnelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).