linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing
@ 2022-11-16 17:16 Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 1/7] s390/ism: Set DMA coherent mask Niklas Schnelle
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

Hi All,

This patch series converts s390's PCI support from its platform specific DMA
API implementation in arch/s390/pci/pci_dma.c to the common DMA IOMMU layer.
The conversion itself is done in patch 3 and after applying my the s390 IOMMU
improvements series[0]. It only touches the s390 IOMMU driver and arch code
moving over remaining functions from the s390 DMA API implementation. No
changes to common code are necessary. Though patch 4 does propose an additional
common code change to let the iommu.strict kernel parameter override
ops->def_domain_type.

After patch 3 the basic conversion is done and on our partitioning machine
hypervisor LPAR performance matches or exceeds the existing code. When running
under z/VM or KVM however, performance plummets to about half of the existing
code due to a much higher rate of IOTLB flushes for unmapped pages and we still
need to handle out-of-resource indications (patch 7). Due to the hypervisors
use of IOTLB flushes to synchronize their shadow tables these are very
expensive and minimizing them is key for regaining the performance loss.

To this end patches 5-7 propose a new, single queue, IOTLB flushing scheme as
an alternative to the existing per-CPU flush queues. Introducing an alternative
scheme was also suggested by Robin Murphy[1]. In the previous RFC of this
conversion Robin suggested reusing more of the existing queuing logic which
I incorporated into this version. The single queue mode is introduced in patch
5. It allows batching a much larger number of lazily freed IOVAs and was also
chosen as hypervisors tend to serialize IOTLB flushes removing some of the
gains of multiple queues. Except for going from one per-CPU to a global queue
the queue logic remains untouched.

Then patch 6 enables variable queue sizes using power of 2 queue sizes and
shift/mask to keep performance as close to the existing code as possible.
Finally patch 7 allows triggering a queue flush from the IOMMU driver in order
to handle s390's use of an IOTLB flush out-of-resource indication.

As it is implemented in common code the single queue IOTLB flushing scheme can
of course be used by other platforms with expensive IOTLB flushes. Particularly
virtio-iommu might be a candidate. With this series however only s390 systems
that require IOTLB flushes on map default to it while LPAR uses the per-CPU
queues.

I did verify that the new scheme does work on my x86_64 Ryzen workstation by
locally modifying drivers/iommu/iommu.c:iommu_subsys_init() to default to the
single queue mode and verifying its use via "/sys/.../iommu_group/type". I did
not find problems with an AMD GPU, Intel NIC (with SR-IOV), NVMes or any on
board peripherals though the only performance test was a round of CS:Go :-)

As with previous series this is available via my git.kernel.org tree[3] in the
dma_iommu_v2 branch with s390_dma_iommu_v2 tag.

NOTE: Due to the large drop in performance and the issue of out-of-resource
handling we can't merge the DMA API conversion (patches 1-3) until we have
a more suited IOVA flushing scheme with similar improvements as the proposed
changes of patches 5-7.

Best regards,
Niklas

[0] https://lore.kernel.org/linux-iommu/20221109142903.4080275-1-schnelle@linux.ibm.com/
[1] https://lore.kernel.org/linux-iommu/3e402947-61f9-b7e8-1414-fde006257b6f@arm.com/
[2] https://lore.kernel.org/linux-iommu/a8e778da-7b41-a6ba-83c3-c366a426c3da@arm.com/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/

Changes since RFC v1:
- Patch 1 uses dma_set_mask_and_coherent() (Christoph)
- Patch 3 now documents and allows the use of iommu.strict=0|1 on s390 and
  deprecates s390_iommu=strict while making it an alias.
- Patches 5-7 completely reworked to reuse existing queue logic (Robin)
- Added patch 4 to allow using iommu.strict=0|1 to override
  ops->def_domain_type.

Niklas Schnelle (7):
  s390/ism: Set DMA coherent mask
  s390/pci: prepare is_passed_through() for dma-iommu
  s390/pci: Use dma-iommu layer
  iommu: Let iommu.strict override ops->def_domain_type
  iommu/dma: Allow a single FQ in addition to per-CPU FQs
  iommu/dma: Enable variable queue size and use larger single queue
  iommu/s390: flush queued IOVAs on RPCIT out of resource indication

 .../admin-guide/kernel-parameters.txt         |   9 +-
 arch/s390/include/asm/pci.h                   |   7 -
 arch/s390/include/asm/pci_dma.h               | 120 +--
 arch/s390/pci/Makefile                        |   2 +-
 arch/s390/pci/pci.c                           |  22 +-
 arch/s390/pci/pci_bus.c                       |   5 -
 arch/s390/pci/pci_debug.c                     |  13 +-
 arch/s390/pci/pci_dma.c                       | 732 ------------------
 arch/s390/pci/pci_event.c                     |  17 +-
 arch/s390/pci/pci_sysfs.c                     |  19 +-
 drivers/iommu/Kconfig                         |   3 +-
 drivers/iommu/dma-iommu.c                     | 188 ++++-
 drivers/iommu/dma-iommu.h                     |   1 +
 drivers/iommu/iommu.c                         |  18 +-
 drivers/iommu/s390-iommu.c                    | 409 +++++++++-
 drivers/s390/net/ism_drv.c                    |   2 +-
 include/linux/iommu.h                         |   7 +
 17 files changed, 603 insertions(+), 971 deletions(-)
 delete mode 100644 arch/s390/pci/pci_dma.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/7] s390/ism: Set DMA coherent mask
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 2/7] s390/pci: prepare is_passed_through() for dma-iommu Niklas Schnelle
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

A future change will convert the DMA API implementation from the
architecture specific arch/s390/pci/pci_dma.c to using the common code
drivers/iommu/dma-iommu.c which the utilizes the same IOMMU hardware
through the s390-iommu driver. Unlike the s390 specific DMA API this
requires devices to correctly call set the coherent mask to be allowed
to use IOVAs >2^32 in dma_alloc_coherent(). This was however not done
for ISM devices. ISM requires such addresses since currently the DMA
aperture for PCI devices starts at 2^32 and all calls to
dma_alloc_coherent() would thus fail.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
v1 -> v2:
- Use dma_set_mask_and_coherent() (Christoph Hellwig)

 drivers/s390/net/ism_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index d34bb6ec1490..da8a549b5965 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -556,7 +556,7 @@ static int ism_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (ret)
 		goto err_disable;
 
-	ret = dma_set_mask(&pdev->dev, DMA_BIT_MASK(64));
+	ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
 	if (ret)
 		goto err_resource;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/7] s390/pci: prepare is_passed_through() for dma-iommu
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 1/7] s390/ism: Set DMA coherent mask Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 3/7] s390/pci: Use dma-iommu layer Niklas Schnelle
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

With the IOMMU always controlled through the IOMMU driver testing for
zdev->s390_domain is not a valid indication of the device being
passed-through. Instead test if zdev->kzdev is set.

Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 arch/s390/pci/pci_event.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index b9324ca2eb94..4ef5a6a1d618 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -59,9 +59,16 @@ static inline bool ers_result_indicates_abort(pci_ers_result_t ers_res)
 	}
 }
 
-static bool is_passed_through(struct zpci_dev *zdev)
+static bool is_passed_through(struct pci_dev *pdev)
 {
-	return zdev->s390_domain;
+	struct zpci_dev *zdev = to_zpci(pdev);
+	bool ret;
+
+	mutex_lock(&zdev->kzdev_lock);
+	ret = !!zdev->kzdev;
+	mutex_unlock(&zdev->kzdev_lock);
+
+	return ret;
 }
 
 static bool is_driver_supported(struct pci_driver *driver)
@@ -176,7 +183,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
 	}
 	pdev->error_state = pci_channel_io_frozen;
 
-	if (is_passed_through(to_zpci(pdev))) {
+	if (is_passed_through(pdev)) {
 		pr_info("%s: Cannot be recovered in the host because it is a pass-through device\n",
 			pci_name(pdev));
 		goto out_unlock;
@@ -239,7 +246,7 @@ static void zpci_event_io_failure(struct pci_dev *pdev, pci_channel_state_t es)
 	 * we will inject the error event and let the guest recover the device
 	 * itself.
 	 */
-	if (is_passed_through(to_zpci(pdev)))
+	if (is_passed_through(pdev))
 		goto out;
 	driver = to_pci_driver(pdev->dev.driver);
 	if (driver && driver->err_handler && driver->err_handler->error_detected)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/7] s390/pci: Use dma-iommu layer
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 1/7] s390/ism: Set DMA coherent mask Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 2/7] s390/pci: prepare is_passed_through() for dma-iommu Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-28 18:03   ` Robin Murphy
  2022-11-16 17:16 ` [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type Niklas Schnelle
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

While s390 already has a standard IOMMU driver and previous changes have
added I/O TLB flushing operations this driver is currently only used for
user-space PCI access such as vfio-pci. For the DMA API s390 instead
utilizes its own implementation in arch/s390/pci/pci_dma.c which drives
the same hardware and shares some code but requires a complex and
fragile hand over between DMA API and IOMMU API use of a device and
despite code sharing still leads to significant duplication and
maintenance effort. Let's utilize the common code DMAP API
implementation from drivers/iommu/dma-iommu.c instead allowing us to
get rid of arch/s390/pci/pci_dma.c.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 .../admin-guide/kernel-parameters.txt         |   9 +-
 arch/s390/include/asm/pci.h                   |   7 -
 arch/s390/include/asm/pci_dma.h               | 120 +--
 arch/s390/pci/Makefile                        |   2 +-
 arch/s390/pci/pci.c                           |  22 +-
 arch/s390/pci/pci_bus.c                       |   5 -
 arch/s390/pci/pci_debug.c                     |  13 +-
 arch/s390/pci/pci_dma.c                       | 732 ------------------
 arch/s390/pci/pci_event.c                     |   2 -
 arch/s390/pci/pci_sysfs.c                     |  19 +-
 drivers/iommu/Kconfig                         |   3 +-
 drivers/iommu/s390-iommu.c                    | 391 +++++++++-
 12 files changed, 408 insertions(+), 917 deletions(-)
 delete mode 100644 arch/s390/pci/pci_dma.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a465d5242774..633312c4f800 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2154,7 +2154,7 @@
 			  forcing Dual Address Cycle for PCI cards supporting
 			  greater than 32-bit addressing.
 
-	iommu.strict=	[ARM64, X86] Configure TLB invalidation behaviour
+	iommu.strict=	[ARM64, X86, S390] Configure TLB invalidation behaviour
 			Format: { "0" | "1" }
 			0 - Lazy mode.
 			  Request that DMA unmap operations use deferred
@@ -5387,9 +5387,10 @@
 	s390_iommu=	[HW,S390]
 			Set s390 IOTLB flushing mode
 		strict
-			With strict flushing every unmap operation will result in
-			an IOTLB flush. Default is lazy flushing before reuse,
-			which is faster.
+			With strict flushing every unmap operation will result
+			in an IOTLB flush. Default is lazy flushing before
+			reuse, which is faster. Deprecated, equivalent to
+			iommu.strict=1.
 
 	s390_iommu_aperture=	[KNL,S390]
 			Specifies the size of the per device DMA address space
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index b248694e0024..3f74f1cf37df 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -159,13 +159,6 @@ struct zpci_dev {
 	unsigned long	*dma_table;
 	int		tlb_refresh;
 
-	spinlock_t	iommu_bitmap_lock;
-	unsigned long	*iommu_bitmap;
-	unsigned long	*lazy_bitmap;
-	unsigned long	iommu_size;
-	unsigned long	iommu_pages;
-	unsigned int	next_bit;
-
 	struct iommu_device iommu_dev;  /* IOMMU core handle */
 
 	char res_name[16];
diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc5..42d7cc4262ca 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -82,116 +82,16 @@ enum zpci_ioat_dtype {
 #define ZPCI_TABLE_VALID_MASK		0x20
 #define ZPCI_TABLE_PROT_MASK		0x200
 
-static inline unsigned int calc_rtx(dma_addr_t ptr)
-{
-	return ((unsigned long) ptr >> ZPCI_RT_SHIFT) & ZPCI_INDEX_MASK;
-}
-
-static inline unsigned int calc_sx(dma_addr_t ptr)
-{
-	return ((unsigned long) ptr >> ZPCI_ST_SHIFT) & ZPCI_INDEX_MASK;
-}
-
-static inline unsigned int calc_px(dma_addr_t ptr)
-{
-	return ((unsigned long) ptr >> PAGE_SHIFT) & ZPCI_PT_MASK;
-}
-
-static inline void set_pt_pfaa(unsigned long *entry, phys_addr_t pfaa)
-{
-	*entry &= ZPCI_PTE_FLAG_MASK;
-	*entry |= (pfaa & ZPCI_PTE_ADDR_MASK);
-}
-
-static inline void set_rt_sto(unsigned long *entry, phys_addr_t sto)
-{
-	*entry &= ZPCI_RTE_FLAG_MASK;
-	*entry |= (sto & ZPCI_RTE_ADDR_MASK);
-	*entry |= ZPCI_TABLE_TYPE_RTX;
-}
-
-static inline void set_st_pto(unsigned long *entry, phys_addr_t pto)
-{
-	*entry &= ZPCI_STE_FLAG_MASK;
-	*entry |= (pto & ZPCI_STE_ADDR_MASK);
-	*entry |= ZPCI_TABLE_TYPE_SX;
-}
-
-static inline void validate_rt_entry(unsigned long *entry)
-{
-	*entry &= ~ZPCI_TABLE_VALID_MASK;
-	*entry &= ~ZPCI_TABLE_OFFSET_MASK;
-	*entry |= ZPCI_TABLE_VALID;
-	*entry |= ZPCI_TABLE_LEN_RTX;
-}
-
-static inline void validate_st_entry(unsigned long *entry)
-{
-	*entry &= ~ZPCI_TABLE_VALID_MASK;
-	*entry |= ZPCI_TABLE_VALID;
-}
-
-static inline void invalidate_pt_entry(unsigned long *entry)
-{
-	WARN_ON_ONCE((*entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_INVALID);
-	*entry &= ~ZPCI_PTE_VALID_MASK;
-	*entry |= ZPCI_PTE_INVALID;
-}
-
-static inline void validate_pt_entry(unsigned long *entry)
-{
-	WARN_ON_ONCE((*entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID);
-	*entry &= ~ZPCI_PTE_VALID_MASK;
-	*entry |= ZPCI_PTE_VALID;
-}
-
-static inline void entry_set_protected(unsigned long *entry)
-{
-	*entry &= ~ZPCI_TABLE_PROT_MASK;
-	*entry |= ZPCI_TABLE_PROTECTED;
-}
-
-static inline void entry_clr_protected(unsigned long *entry)
-{
-	*entry &= ~ZPCI_TABLE_PROT_MASK;
-	*entry |= ZPCI_TABLE_UNPROTECTED;
-}
-
-static inline int reg_entry_isvalid(unsigned long entry)
-{
-	return (entry & ZPCI_TABLE_VALID_MASK) == ZPCI_TABLE_VALID;
-}
-
-static inline int pt_entry_isvalid(unsigned long entry)
-{
-	return (entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID;
-}
-
-static inline unsigned long *get_rt_sto(unsigned long entry)
-{
-	if ((entry & ZPCI_TABLE_TYPE_MASK) == ZPCI_TABLE_TYPE_RTX)
-		return phys_to_virt(entry & ZPCI_RTE_ADDR_MASK);
-	else
-		return NULL;
-
-}
-
-static inline unsigned long *get_st_pto(unsigned long entry)
-{
-	if ((entry & ZPCI_TABLE_TYPE_MASK) == ZPCI_TABLE_TYPE_SX)
-		return phys_to_virt(entry & ZPCI_STE_ADDR_MASK);
-	else
-		return NULL;
-}
-
-/* Prototypes */
-void dma_free_seg_table(unsigned long);
-unsigned long *dma_alloc_cpu_table(void);
-void dma_cleanup_tables(unsigned long *);
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr);
-void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int flags);
-
-extern const struct dma_map_ops s390_pci_dma_ops;
+struct zpci_iommu_ctrs {
+	atomic64_t		mapped_pages;
+	atomic64_t		unmapped_pages;
+	atomic64_t		global_rpcits;
+	atomic64_t		sync_map_rpcits;
+	atomic64_t		sync_rpcits;
+};
+
+struct zpci_dev;
 
+struct zpci_iommu_ctrs *zpci_get_iommu_ctrs(struct zpci_dev *zdev);
 
 #endif
diff --git a/arch/s390/pci/Makefile b/arch/s390/pci/Makefile
index 5ae31ca9dd44..0547a10406e7 100644
--- a/arch/s390/pci/Makefile
+++ b/arch/s390/pci/Makefile
@@ -3,7 +3,7 @@
 # Makefile for the s390 PCI subsystem.
 #
 
-obj-$(CONFIG_PCI)	+= pci.o pci_irq.o pci_dma.o pci_clp.o pci_sysfs.o \
+obj-$(CONFIG_PCI)	+= pci.o pci_irq.o pci_clp.o pci_sysfs.o \
 			   pci_event.o pci_debug.o pci_insn.o pci_mmio.o \
 			   pci_bus.o pci_kvm_hook.o
 obj-$(CONFIG_PCI_IOV)	+= pci_iov.o
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index ef38b1514c77..6b0fe8761509 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -124,7 +124,11 @@ int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
 
 	WARN_ON_ONCE(iota & 0x3fff);
 	fib.pba = base;
-	fib.pal = limit;
+	/* Work around off by one in ISM virt device */
+	if (zdev->pft == 0x5 && limit > base)
+		fib.pal = limit + (1 << 12);
+	else
+		fib.pal = limit;
 	fib.iota = iota | ZPCI_IOTA_RTTO_FLAG;
 	fib.gd = zdev->gisa;
 	cc = zpci_mod_fc(req, &fib, status);
@@ -615,7 +619,6 @@ int pcibios_device_add(struct pci_dev *pdev)
 		pdev->no_vf_scan = 1;
 
 	pdev->dev.groups = zpci_attr_groups;
-	pdev->dev.dma_ops = &s390_pci_dma_ops;
 	zpci_map_resources(pdev);
 
 	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
@@ -789,8 +792,6 @@ int zpci_hot_reset_device(struct zpci_dev *zdev)
 	if (zdev->dma_table)
 		rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
 					virt_to_phys(zdev->dma_table), &status);
-	else
-		rc = zpci_dma_init_device(zdev);
 	if (rc) {
 		zpci_disable_device(zdev);
 		return rc;
@@ -915,11 +916,6 @@ int zpci_deconfigure_device(struct zpci_dev *zdev)
 	if (zdev->zbus->bus)
 		zpci_bus_remove_device(zdev, false);
 
-	if (zdev->dma_table) {
-		rc = zpci_dma_exit_device(zdev);
-		if (rc)
-			return rc;
-	}
 	if (zdev_enabled(zdev)) {
 		rc = zpci_disable_device(zdev);
 		if (rc)
@@ -968,8 +964,6 @@ void zpci_release_device(struct kref *kref)
 	if (zdev->zbus->bus)
 		zpci_bus_remove_device(zdev, false);
 
-	if (zdev->dma_table)
-		zpci_dma_exit_device(zdev);
 	if (zdev_enabled(zdev))
 		zpci_disable_device(zdev);
 
@@ -1159,10 +1153,6 @@ static int __init pci_base_init(void)
 	if (rc)
 		goto out_irq;
 
-	rc = zpci_dma_init();
-	if (rc)
-		goto out_dma;
-
 	rc = clp_scan_pci_devices();
 	if (rc)
 		goto out_find;
@@ -1172,8 +1162,6 @@ static int __init pci_base_init(void)
 	return 0;
 
 out_find:
-	zpci_dma_exit();
-out_dma:
 	zpci_irq_exit();
 out_irq:
 	zpci_mem_exit();
diff --git a/arch/s390/pci/pci_bus.c b/arch/s390/pci/pci_bus.c
index 6a8da1b742ae..b15ad15999f8 100644
--- a/arch/s390/pci/pci_bus.c
+++ b/arch/s390/pci/pci_bus.c
@@ -49,11 +49,6 @@ static int zpci_bus_prepare_device(struct zpci_dev *zdev)
 		rc = zpci_enable_device(zdev);
 		if (rc)
 			return rc;
-		rc = zpci_dma_init_device(zdev);
-		if (rc) {
-			zpci_disable_device(zdev);
-			return rc;
-		}
 	}
 
 	if (!zdev->has_resources) {
diff --git a/arch/s390/pci/pci_debug.c b/arch/s390/pci/pci_debug.c
index ca6bd98eec13..60cec57a3907 100644
--- a/arch/s390/pci/pci_debug.c
+++ b/arch/s390/pci/pci_debug.c
@@ -53,9 +53,12 @@ static char *pci_fmt3_names[] = {
 };
 
 static char *pci_sw_names[] = {
-	"Allocated pages",
+/* TODO "Allocated pages", */
 	"Mapped pages",
 	"Unmapped pages",
+	"Global RPCITs",
+	"Sync Map RPCITs",
+	"Sync RPCITs",
 };
 
 static void pci_fmb_show(struct seq_file *m, char *name[], int length,
@@ -69,10 +72,14 @@ static void pci_fmb_show(struct seq_file *m, char *name[], int length,
 
 static void pci_sw_counter_show(struct seq_file *m)
 {
-	struct zpci_dev *zdev = m->private;
-	atomic64_t *counter = &zdev->allocated_pages;
+	struct zpci_iommu_ctrs  *ctrs = zpci_get_iommu_ctrs(m->private);
+	atomic64_t *counter;
 	int i;
 
+	if (!ctrs)
+		return;
+
+	counter = &ctrs->mapped_pages;
 	for (i = 0; i < ARRAY_SIZE(pci_sw_names); i++, counter++)
 		seq_printf(m, "%26s:\t%llu\n", pci_sw_names[i],
 			   atomic64_read(counter));
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
deleted file mode 100644
index ea478d11fbd1..000000000000
--- a/arch/s390/pci/pci_dma.c
+++ /dev/null
@@ -1,732 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright IBM Corp. 2012
- *
- * Author(s):
- *   Jan Glauber <jang@linux.vnet.ibm.com>
- */
-
-#include <linux/kernel.h>
-#include <linux/slab.h>
-#include <linux/export.h>
-#include <linux/iommu-helper.h>
-#include <linux/dma-map-ops.h>
-#include <linux/vmalloc.h>
-#include <linux/pci.h>
-#include <asm/pci_dma.h>
-
-static struct kmem_cache *dma_region_table_cache;
-static struct kmem_cache *dma_page_table_cache;
-static int s390_iommu_strict;
-static u64 s390_iommu_aperture;
-static u32 s390_iommu_aperture_factor = 1;
-
-static int zpci_refresh_global(struct zpci_dev *zdev)
-{
-	return zpci_refresh_trans((u64) zdev->fh << 32, zdev->start_dma,
-				  zdev->iommu_pages * PAGE_SIZE);
-}
-
-unsigned long *dma_alloc_cpu_table(void)
-{
-	unsigned long *table, *entry;
-
-	table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
-	if (!table)
-		return NULL;
-
-	for (entry = table; entry < table + ZPCI_TABLE_ENTRIES; entry++)
-		*entry = ZPCI_TABLE_INVALID;
-	return table;
-}
-
-static void dma_free_cpu_table(void *table)
-{
-	kmem_cache_free(dma_region_table_cache, table);
-}
-
-static unsigned long *dma_alloc_page_table(void)
-{
-	unsigned long *table, *entry;
-
-	table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
-	if (!table)
-		return NULL;
-
-	for (entry = table; entry < table + ZPCI_PT_ENTRIES; entry++)
-		*entry = ZPCI_PTE_INVALID;
-	return table;
-}
-
-static void dma_free_page_table(void *table)
-{
-	kmem_cache_free(dma_page_table_cache, table);
-}
-
-static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
-{
-	unsigned long old_rte, rte;
-	unsigned long *sto;
-
-	rte = READ_ONCE(*rtep);
-	if (reg_entry_isvalid(rte)) {
-		sto = get_rt_sto(rte);
-	} else {
-		sto = dma_alloc_cpu_table();
-		if (!sto)
-			return NULL;
-
-		set_rt_sto(&rte, virt_to_phys(sto));
-		validate_rt_entry(&rte);
-		entry_clr_protected(&rte);
-
-		old_rte = cmpxchg(rtep, ZPCI_TABLE_INVALID, rte);
-		if (old_rte != ZPCI_TABLE_INVALID) {
-			/* Somone else was faster, use theirs */
-			dma_free_cpu_table(sto);
-			sto = get_rt_sto(old_rte);
-		}
-	}
-	return sto;
-}
-
-static unsigned long *dma_get_page_table_origin(unsigned long *step)
-{
-	unsigned long old_ste, ste;
-	unsigned long *pto;
-
-	ste = READ_ONCE(*step);
-	if (reg_entry_isvalid(ste)) {
-		pto = get_st_pto(ste);
-	} else {
-		pto = dma_alloc_page_table();
-		if (!pto)
-			return NULL;
-		set_st_pto(&ste, virt_to_phys(pto));
-		validate_st_entry(&ste);
-		entry_clr_protected(&ste);
-
-		old_ste = cmpxchg(step, ZPCI_TABLE_INVALID, ste);
-		if (old_ste != ZPCI_TABLE_INVALID) {
-			/* Somone else was faster, use theirs */
-			dma_free_page_table(pto);
-			pto = get_st_pto(old_ste);
-		}
-	}
-	return pto;
-}
-
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
-{
-	unsigned long *sto, *pto;
-	unsigned int rtx, sx, px;
-
-	rtx = calc_rtx(dma_addr);
-	sto = dma_get_seg_table_origin(&rto[rtx]);
-	if (!sto)
-		return NULL;
-
-	sx = calc_sx(dma_addr);
-	pto = dma_get_page_table_origin(&sto[sx]);
-	if (!pto)
-		return NULL;
-
-	px = calc_px(dma_addr);
-	return &pto[px];
-}
-
-void dma_update_cpu_trans(unsigned long *ptep, phys_addr_t page_addr, int flags)
-{
-	unsigned long pte;
-
-	pte = READ_ONCE(*ptep);
-	if (flags & ZPCI_PTE_INVALID) {
-		invalidate_pt_entry(&pte);
-	} else {
-		set_pt_pfaa(&pte, page_addr);
-		validate_pt_entry(&pte);
-	}
-
-	if (flags & ZPCI_TABLE_PROTECTED)
-		entry_set_protected(&pte);
-	else
-		entry_clr_protected(&pte);
-
-	xchg(ptep, pte);
-}
-
-static int __dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
-			      dma_addr_t dma_addr, size_t size, int flags)
-{
-	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	phys_addr_t page_addr = (pa & PAGE_MASK);
-	unsigned long *entry;
-	int i, rc = 0;
-
-	if (!nr_pages)
-		return -EINVAL;
-
-	if (!zdev->dma_table)
-		return -EINVAL;
-
-	for (i = 0; i < nr_pages; i++) {
-		entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
-		if (!entry) {
-			rc = -ENOMEM;
-			goto undo_cpu_trans;
-		}
-		dma_update_cpu_trans(entry, page_addr, flags);
-		page_addr += PAGE_SIZE;
-		dma_addr += PAGE_SIZE;
-	}
-
-undo_cpu_trans:
-	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID)) {
-		flags = ZPCI_PTE_INVALID;
-		while (i-- > 0) {
-			page_addr -= PAGE_SIZE;
-			dma_addr -= PAGE_SIZE;
-			entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
-			if (!entry)
-				break;
-			dma_update_cpu_trans(entry, page_addr, flags);
-		}
-	}
-	return rc;
-}
-
-static int __dma_purge_tlb(struct zpci_dev *zdev, dma_addr_t dma_addr,
-			   size_t size, int flags)
-{
-	unsigned long irqflags;
-	int ret;
-
-	/*
-	 * With zdev->tlb_refresh == 0, rpcit is not required to establish new
-	 * translations when previously invalid translation-table entries are
-	 * validated. With lazy unmap, rpcit is skipped for previously valid
-	 * entries, but a global rpcit is then required before any address can
-	 * be re-used, i.e. after each iommu bitmap wrap-around.
-	 */
-	if ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID) {
-		if (!zdev->tlb_refresh)
-			return 0;
-	} else {
-		if (!s390_iommu_strict)
-			return 0;
-	}
-
-	ret = zpci_refresh_trans((u64) zdev->fh << 32, dma_addr,
-				 PAGE_ALIGN(size));
-	if (ret == -ENOMEM && !s390_iommu_strict) {
-		/* enable the hypervisor to free some resources */
-		if (zpci_refresh_global(zdev))
-			goto out;
-
-		spin_lock_irqsave(&zdev->iommu_bitmap_lock, irqflags);
-		bitmap_andnot(zdev->iommu_bitmap, zdev->iommu_bitmap,
-			      zdev->lazy_bitmap, zdev->iommu_pages);
-		bitmap_zero(zdev->lazy_bitmap, zdev->iommu_pages);
-		spin_unlock_irqrestore(&zdev->iommu_bitmap_lock, irqflags);
-		ret = 0;
-	}
-out:
-	return ret;
-}
-
-static int dma_update_trans(struct zpci_dev *zdev, phys_addr_t pa,
-			    dma_addr_t dma_addr, size_t size, int flags)
-{
-	int rc;
-
-	rc = __dma_update_trans(zdev, pa, dma_addr, size, flags);
-	if (rc)
-		return rc;
-
-	rc = __dma_purge_tlb(zdev, dma_addr, size, flags);
-	if (rc && ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID))
-		__dma_update_trans(zdev, pa, dma_addr, size, ZPCI_PTE_INVALID);
-
-	return rc;
-}
-
-void dma_free_seg_table(unsigned long entry)
-{
-	unsigned long *sto = get_rt_sto(entry);
-	int sx;
-
-	for (sx = 0; sx < ZPCI_TABLE_ENTRIES; sx++)
-		if (reg_entry_isvalid(sto[sx]))
-			dma_free_page_table(get_st_pto(sto[sx]));
-
-	dma_free_cpu_table(sto);
-}
-
-void dma_cleanup_tables(unsigned long *table)
-{
-	int rtx;
-
-	if (!table)
-		return;
-
-	for (rtx = 0; rtx < ZPCI_TABLE_ENTRIES; rtx++)
-		if (reg_entry_isvalid(table[rtx]))
-			dma_free_seg_table(table[rtx]);
-
-	dma_free_cpu_table(table);
-}
-
-static unsigned long __dma_alloc_iommu(struct device *dev,
-				       unsigned long start, int size)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-
-	return iommu_area_alloc(zdev->iommu_bitmap, zdev->iommu_pages,
-				start, size, zdev->start_dma >> PAGE_SHIFT,
-				dma_get_seg_boundary_nr_pages(dev, PAGE_SHIFT),
-				0);
-}
-
-static dma_addr_t dma_alloc_address(struct device *dev, int size)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	unsigned long offset, flags;
-
-	spin_lock_irqsave(&zdev->iommu_bitmap_lock, flags);
-	offset = __dma_alloc_iommu(dev, zdev->next_bit, size);
-	if (offset == -1) {
-		if (!s390_iommu_strict) {
-			/* global flush before DMA addresses are reused */
-			if (zpci_refresh_global(zdev))
-				goto out_error;
-
-			bitmap_andnot(zdev->iommu_bitmap, zdev->iommu_bitmap,
-				      zdev->lazy_bitmap, zdev->iommu_pages);
-			bitmap_zero(zdev->lazy_bitmap, zdev->iommu_pages);
-		}
-		/* wrap-around */
-		offset = __dma_alloc_iommu(dev, 0, size);
-		if (offset == -1)
-			goto out_error;
-	}
-	zdev->next_bit = offset + size;
-	spin_unlock_irqrestore(&zdev->iommu_bitmap_lock, flags);
-
-	return zdev->start_dma + offset * PAGE_SIZE;
-
-out_error:
-	spin_unlock_irqrestore(&zdev->iommu_bitmap_lock, flags);
-	return DMA_MAPPING_ERROR;
-}
-
-static void dma_free_address(struct device *dev, dma_addr_t dma_addr, int size)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	unsigned long flags, offset;
-
-	offset = (dma_addr - zdev->start_dma) >> PAGE_SHIFT;
-
-	spin_lock_irqsave(&zdev->iommu_bitmap_lock, flags);
-	if (!zdev->iommu_bitmap)
-		goto out;
-
-	if (s390_iommu_strict)
-		bitmap_clear(zdev->iommu_bitmap, offset, size);
-	else
-		bitmap_set(zdev->lazy_bitmap, offset, size);
-
-out:
-	spin_unlock_irqrestore(&zdev->iommu_bitmap_lock, flags);
-}
-
-static inline void zpci_err_dma(unsigned long rc, unsigned long addr)
-{
-	struct {
-		unsigned long rc;
-		unsigned long addr;
-	} __packed data = {rc, addr};
-
-	zpci_err_hex(&data, sizeof(data));
-}
-
-static dma_addr_t s390_dma_map_pages(struct device *dev, struct page *page,
-				     unsigned long offset, size_t size,
-				     enum dma_data_direction direction,
-				     unsigned long attrs)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	unsigned long pa = page_to_phys(page) + offset;
-	int flags = ZPCI_PTE_VALID;
-	unsigned long nr_pages;
-	dma_addr_t dma_addr;
-	int ret;
-
-	/* This rounds up number of pages based on size and offset */
-	nr_pages = iommu_num_pages(pa, size, PAGE_SIZE);
-	dma_addr = dma_alloc_address(dev, nr_pages);
-	if (dma_addr == DMA_MAPPING_ERROR) {
-		ret = -ENOSPC;
-		goto out_err;
-	}
-
-	/* Use rounded up size */
-	size = nr_pages * PAGE_SIZE;
-
-	if (direction == DMA_NONE || direction == DMA_TO_DEVICE)
-		flags |= ZPCI_TABLE_PROTECTED;
-
-	ret = dma_update_trans(zdev, pa, dma_addr, size, flags);
-	if (ret)
-		goto out_free;
-
-	atomic64_add(nr_pages, &zdev->mapped_pages);
-	return dma_addr + (offset & ~PAGE_MASK);
-
-out_free:
-	dma_free_address(dev, dma_addr, nr_pages);
-out_err:
-	zpci_err("map error:\n");
-	zpci_err_dma(ret, pa);
-	return DMA_MAPPING_ERROR;
-}
-
-static void s390_dma_unmap_pages(struct device *dev, dma_addr_t dma_addr,
-				 size_t size, enum dma_data_direction direction,
-				 unsigned long attrs)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	int npages, ret;
-
-	npages = iommu_num_pages(dma_addr, size, PAGE_SIZE);
-	dma_addr = dma_addr & PAGE_MASK;
-	ret = dma_update_trans(zdev, 0, dma_addr, npages * PAGE_SIZE,
-			       ZPCI_PTE_INVALID);
-	if (ret) {
-		zpci_err("unmap error:\n");
-		zpci_err_dma(ret, dma_addr);
-		return;
-	}
-
-	atomic64_add(npages, &zdev->unmapped_pages);
-	dma_free_address(dev, dma_addr, npages);
-}
-
-static void *s390_dma_alloc(struct device *dev, size_t size,
-			    dma_addr_t *dma_handle, gfp_t flag,
-			    unsigned long attrs)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	struct page *page;
-	phys_addr_t pa;
-	dma_addr_t map;
-
-	size = PAGE_ALIGN(size);
-	page = alloc_pages(flag | __GFP_ZERO, get_order(size));
-	if (!page)
-		return NULL;
-
-	pa = page_to_phys(page);
-	map = s390_dma_map_pages(dev, page, 0, size, DMA_BIDIRECTIONAL, 0);
-	if (dma_mapping_error(dev, map)) {
-		__free_pages(page, get_order(size));
-		return NULL;
-	}
-
-	atomic64_add(size / PAGE_SIZE, &zdev->allocated_pages);
-	if (dma_handle)
-		*dma_handle = map;
-	return phys_to_virt(pa);
-}
-
-static void s390_dma_free(struct device *dev, size_t size,
-			  void *vaddr, dma_addr_t dma_handle,
-			  unsigned long attrs)
-{
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-
-	size = PAGE_ALIGN(size);
-	atomic64_sub(size / PAGE_SIZE, &zdev->allocated_pages);
-	s390_dma_unmap_pages(dev, dma_handle, size, DMA_BIDIRECTIONAL, 0);
-	free_pages((unsigned long)vaddr, get_order(size));
-}
-
-/* Map a segment into a contiguous dma address area */
-static int __s390_dma_map_sg(struct device *dev, struct scatterlist *sg,
-			     size_t size, dma_addr_t *handle,
-			     enum dma_data_direction dir)
-{
-	unsigned long nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
-	dma_addr_t dma_addr_base, dma_addr;
-	int flags = ZPCI_PTE_VALID;
-	struct scatterlist *s;
-	phys_addr_t pa = 0;
-	int ret;
-
-	dma_addr_base = dma_alloc_address(dev, nr_pages);
-	if (dma_addr_base == DMA_MAPPING_ERROR)
-		return -ENOMEM;
-
-	dma_addr = dma_addr_base;
-	if (dir == DMA_NONE || dir == DMA_TO_DEVICE)
-		flags |= ZPCI_TABLE_PROTECTED;
-
-	for (s = sg; dma_addr < dma_addr_base + size; s = sg_next(s)) {
-		pa = page_to_phys(sg_page(s));
-		ret = __dma_update_trans(zdev, pa, dma_addr,
-					 s->offset + s->length, flags);
-		if (ret)
-			goto unmap;
-
-		dma_addr += s->offset + s->length;
-	}
-	ret = __dma_purge_tlb(zdev, dma_addr_base, size, flags);
-	if (ret)
-		goto unmap;
-
-	*handle = dma_addr_base;
-	atomic64_add(nr_pages, &zdev->mapped_pages);
-
-	return ret;
-
-unmap:
-	dma_update_trans(zdev, 0, dma_addr_base, dma_addr - dma_addr_base,
-			 ZPCI_PTE_INVALID);
-	dma_free_address(dev, dma_addr_base, nr_pages);
-	zpci_err("map error:\n");
-	zpci_err_dma(ret, pa);
-	return ret;
-}
-
-static int s390_dma_map_sg(struct device *dev, struct scatterlist *sg,
-			   int nr_elements, enum dma_data_direction dir,
-			   unsigned long attrs)
-{
-	struct scatterlist *s = sg, *start = sg, *dma = sg;
-	unsigned int max = dma_get_max_seg_size(dev);
-	unsigned int size = s->offset + s->length;
-	unsigned int offset = s->offset;
-	int count = 0, i, ret;
-
-	for (i = 1; i < nr_elements; i++) {
-		s = sg_next(s);
-
-		s->dma_length = 0;
-
-		if (s->offset || (size & ~PAGE_MASK) ||
-		    size + s->length > max) {
-			ret = __s390_dma_map_sg(dev, start, size,
-						&dma->dma_address, dir);
-			if (ret)
-				goto unmap;
-
-			dma->dma_address += offset;
-			dma->dma_length = size - offset;
-
-			size = offset = s->offset;
-			start = s;
-			dma = sg_next(dma);
-			count++;
-		}
-		size += s->length;
-	}
-	ret = __s390_dma_map_sg(dev, start, size, &dma->dma_address, dir);
-	if (ret)
-		goto unmap;
-
-	dma->dma_address += offset;
-	dma->dma_length = size - offset;
-
-	return count + 1;
-unmap:
-	for_each_sg(sg, s, count, i)
-		s390_dma_unmap_pages(dev, sg_dma_address(s), sg_dma_len(s),
-				     dir, attrs);
-
-	return ret;
-}
-
-static void s390_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
-			      int nr_elements, enum dma_data_direction dir,
-			      unsigned long attrs)
-{
-	struct scatterlist *s;
-	int i;
-
-	for_each_sg(sg, s, nr_elements, i) {
-		if (s->dma_length)
-			s390_dma_unmap_pages(dev, s->dma_address, s->dma_length,
-					     dir, attrs);
-		s->dma_address = 0;
-		s->dma_length = 0;
-	}
-}
-	
-int zpci_dma_init_device(struct zpci_dev *zdev)
-{
-	u8 status;
-	int rc;
-
-	/*
-	 * At this point, if the device is part of an IOMMU domain, this would
-	 * be a strong hint towards a bug in the IOMMU API (common) code and/or
-	 * simultaneous access via IOMMU and DMA API. So let's issue a warning.
-	 */
-	WARN_ON(zdev->s390_domain);
-
-	spin_lock_init(&zdev->iommu_bitmap_lock);
-
-	zdev->dma_table = dma_alloc_cpu_table();
-	if (!zdev->dma_table) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	/*
-	 * Restrict the iommu bitmap size to the minimum of the following:
-	 * - s390_iommu_aperture which defaults to high_memory
-	 * - 3-level pagetable address limit minus start_dma offset
-	 * - DMA address range allowed by the hardware (clp query pci fn)
-	 *
-	 * Also set zdev->end_dma to the actual end address of the usable
-	 * range, instead of the theoretical maximum as reported by hardware.
-	 *
-	 * This limits the number of concurrently usable DMA mappings since
-	 * for each DMA mapped memory address we need a DMA address including
-	 * extra DMA addresses for multiple mappings of the same memory address.
-	 */
-	zdev->start_dma = PAGE_ALIGN(zdev->start_dma);
-	zdev->iommu_size = min3(s390_iommu_aperture,
-				ZPCI_TABLE_SIZE_RT - zdev->start_dma,
-				zdev->end_dma - zdev->start_dma + 1);
-	zdev->end_dma = zdev->start_dma + zdev->iommu_size - 1;
-	zdev->iommu_pages = zdev->iommu_size >> PAGE_SHIFT;
-	zdev->iommu_bitmap = vzalloc(zdev->iommu_pages / 8);
-	if (!zdev->iommu_bitmap) {
-		rc = -ENOMEM;
-		goto free_dma_table;
-	}
-	if (!s390_iommu_strict) {
-		zdev->lazy_bitmap = vzalloc(zdev->iommu_pages / 8);
-		if (!zdev->lazy_bitmap) {
-			rc = -ENOMEM;
-			goto free_bitmap;
-		}
-
-	}
-	if (zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
-			       virt_to_phys(zdev->dma_table), &status)) {
-		rc = -EIO;
-		goto free_bitmap;
-	}
-
-	return 0;
-free_bitmap:
-	vfree(zdev->iommu_bitmap);
-	zdev->iommu_bitmap = NULL;
-	vfree(zdev->lazy_bitmap);
-	zdev->lazy_bitmap = NULL;
-free_dma_table:
-	dma_free_cpu_table(zdev->dma_table);
-	zdev->dma_table = NULL;
-out:
-	return rc;
-}
-
-int zpci_dma_exit_device(struct zpci_dev *zdev)
-{
-	int cc = 0;
-
-	/*
-	 * At this point, if the device is part of an IOMMU domain, this would
-	 * be a strong hint towards a bug in the IOMMU API (common) code and/or
-	 * simultaneous access via IOMMU and DMA API. So let's issue a warning.
-	 */
-	WARN_ON(zdev->s390_domain);
-	if (zdev_enabled(zdev))
-		cc = zpci_unregister_ioat(zdev, 0);
-	/*
-	 * cc == 3 indicates the function is gone already. This can happen
-	 * if the function was deconfigured/disabled suddenly and we have not
-	 * received a new handle yet.
-	 */
-	if (cc && cc != 3)
-		return -EIO;
-
-	dma_cleanup_tables(zdev->dma_table);
-	zdev->dma_table = NULL;
-	vfree(zdev->iommu_bitmap);
-	zdev->iommu_bitmap = NULL;
-	vfree(zdev->lazy_bitmap);
-	zdev->lazy_bitmap = NULL;
-	zdev->next_bit = 0;
-	return 0;
-}
-
-static int __init dma_alloc_cpu_table_caches(void)
-{
-	dma_region_table_cache = kmem_cache_create("PCI_DMA_region_tables",
-					ZPCI_TABLE_SIZE, ZPCI_TABLE_ALIGN,
-					0, NULL);
-	if (!dma_region_table_cache)
-		return -ENOMEM;
-
-	dma_page_table_cache = kmem_cache_create("PCI_DMA_page_tables",
-					ZPCI_PT_SIZE, ZPCI_PT_ALIGN,
-					0, NULL);
-	if (!dma_page_table_cache) {
-		kmem_cache_destroy(dma_region_table_cache);
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-int __init zpci_dma_init(void)
-{
-	s390_iommu_aperture = (u64)virt_to_phys(high_memory);
-	if (!s390_iommu_aperture_factor)
-		s390_iommu_aperture = ULONG_MAX;
-	else
-		s390_iommu_aperture *= s390_iommu_aperture_factor;
-
-	return dma_alloc_cpu_table_caches();
-}
-
-void zpci_dma_exit(void)
-{
-	kmem_cache_destroy(dma_page_table_cache);
-	kmem_cache_destroy(dma_region_table_cache);
-}
-
-const struct dma_map_ops s390_pci_dma_ops = {
-	.alloc		= s390_dma_alloc,
-	.free		= s390_dma_free,
-	.map_sg		= s390_dma_map_sg,
-	.unmap_sg	= s390_dma_unmap_sg,
-	.map_page	= s390_dma_map_pages,
-	.unmap_page	= s390_dma_unmap_pages,
-	.mmap		= dma_common_mmap,
-	.get_sgtable	= dma_common_get_sgtable,
-	.alloc_pages	= dma_common_alloc_pages,
-	.free_pages	= dma_common_free_pages,
-	/* dma_supported is unconditionally true without a callback */
-};
-EXPORT_SYMBOL_GPL(s390_pci_dma_ops);
-
-static int __init s390_iommu_setup(char *str)
-{
-	if (!strcmp(str, "strict"))
-		s390_iommu_strict = 1;
-	return 1;
-}
-
-__setup("s390_iommu=", s390_iommu_setup);
-
-static int __init s390_iommu_aperture_setup(char *str)
-{
-	if (kstrtou32(str, 10, &s390_iommu_aperture_factor))
-		s390_iommu_aperture_factor = 1;
-	return 1;
-}
-
-__setup("s390_iommu_aperture=", s390_iommu_aperture_setup);
diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
index 4ef5a6a1d618..4d9773ef9e0a 100644
--- a/arch/s390/pci/pci_event.c
+++ b/arch/s390/pci/pci_event.c
@@ -313,8 +313,6 @@ static void zpci_event_hard_deconfigured(struct zpci_dev *zdev, u32 fh)
 	/* Even though the device is already gone we still
 	 * need to free zPCI resources as part of the disable.
 	 */
-	if (zdev->dma_table)
-		zpci_dma_exit_device(zdev);
 	if (zdev_enabled(zdev))
 		zpci_disable_device(zdev);
 	zdev->state = ZPCI_FN_STATE_STANDBY;
diff --git a/arch/s390/pci/pci_sysfs.c b/arch/s390/pci/pci_sysfs.c
index cae280e5c047..8a7abac51816 100644
--- a/arch/s390/pci/pci_sysfs.c
+++ b/arch/s390/pci/pci_sysfs.c
@@ -56,6 +56,7 @@ static ssize_t recover_store(struct device *dev, struct device_attribute *attr,
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct zpci_dev *zdev = to_zpci(pdev);
 	int ret = 0;
+	u8 status;
 
 	/* Can't use device_remove_self() here as that would lead us to lock
 	 * the pci_rescan_remove_lock while holding the device' kernfs lock.
@@ -82,12 +83,6 @@ static ssize_t recover_store(struct device *dev, struct device_attribute *attr,
 	pci_lock_rescan_remove();
 	if (pci_dev_is_added(pdev)) {
 		pci_stop_and_remove_bus_device(pdev);
-		if (zdev->dma_table) {
-			ret = zpci_dma_exit_device(zdev);
-			if (ret)
-				goto out;
-		}
-
 		if (zdev_enabled(zdev)) {
 			ret = zpci_disable_device(zdev);
 			/*
@@ -105,14 +100,16 @@ static ssize_t recover_store(struct device *dev, struct device_attribute *attr,
 		ret = zpci_enable_device(zdev);
 		if (ret)
 			goto out;
-		ret = zpci_dma_init_device(zdev);
-		if (ret) {
-			zpci_disable_device(zdev);
-			goto out;
+
+		if (zdev->dma_table) {
+			ret = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
+						 virt_to_phys(zdev->dma_table), &status);
+			if (ret)
+				zpci_disable_device(zdev);
 		}
-		pci_rescan_bus(zdev->zbus->bus);
 	}
 out:
+	pci_rescan_bus(zdev->zbus->bus);
 	pci_unlock_rescan_remove();
 	if (kn)
 		sysfs_unbreak_active_protection(kn);
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dc5f7a156ff5..804fb8f42d61 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -93,7 +93,7 @@ config IOMMU_DEBUGFS
 choice
 	prompt "IOMMU default domain type"
 	depends on IOMMU_API
-	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64
+	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64 || S390
 	default IOMMU_DEFAULT_DMA_STRICT
 	help
 	  Choose the type of IOMMU domain used to manage DMA API usage by
@@ -412,6 +412,7 @@ config ARM_SMMU_V3_SVA
 config S390_IOMMU
 	def_bool y if S390 && PCI
 	depends on S390 && PCI
+	select IOMMU_DMA
 	select IOMMU_API
 	help
 	  Support for the IOMMU API for s390 PCI devices.
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index ed33c6cce083..9fd788d64ac8 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -14,16 +14,300 @@
 #include <linux/rcupdate.h>
 #include <asm/pci_dma.h>
 
+#include "dma-iommu.h"
+
 static const struct iommu_ops s390_iommu_ops;
 
+static struct kmem_cache *dma_region_table_cache;
+static struct kmem_cache *dma_page_table_cache;
+
+static u64 s390_iommu_aperture;
+static u32 s390_iommu_aperture_factor = 1;
+
 struct s390_domain {
 	struct iommu_domain	domain;
 	struct list_head	devices;
+	struct zpci_iommu_ctrs	ctrs;
 	unsigned long		*dma_table;
 	spinlock_t		list_lock;
 	struct rcu_head		rcu;
 };
 
+static inline unsigned int calc_rtx(dma_addr_t ptr)
+{
+	return ((unsigned long)ptr >> ZPCI_RT_SHIFT) & ZPCI_INDEX_MASK;
+}
+
+static inline unsigned int calc_sx(dma_addr_t ptr)
+{
+	return ((unsigned long)ptr >> ZPCI_ST_SHIFT) & ZPCI_INDEX_MASK;
+}
+
+static inline unsigned int calc_px(dma_addr_t ptr)
+{
+	return ((unsigned long)ptr >> PAGE_SHIFT) & ZPCI_PT_MASK;
+}
+
+static inline void set_pt_pfaa(unsigned long *entry, phys_addr_t pfaa)
+{
+	*entry &= ZPCI_PTE_FLAG_MASK;
+	*entry |= (pfaa & ZPCI_PTE_ADDR_MASK);
+}
+
+static inline void set_rt_sto(unsigned long *entry, phys_addr_t sto)
+{
+	*entry &= ZPCI_RTE_FLAG_MASK;
+	*entry |= (sto & ZPCI_RTE_ADDR_MASK);
+	*entry |= ZPCI_TABLE_TYPE_RTX;
+}
+
+static inline void set_st_pto(unsigned long *entry, phys_addr_t pto)
+{
+	*entry &= ZPCI_STE_FLAG_MASK;
+	*entry |= (pto & ZPCI_STE_ADDR_MASK);
+	*entry |= ZPCI_TABLE_TYPE_SX;
+}
+
+static inline void validate_rt_entry(unsigned long *entry)
+{
+	*entry &= ~ZPCI_TABLE_VALID_MASK;
+	*entry &= ~ZPCI_TABLE_OFFSET_MASK;
+	*entry |= ZPCI_TABLE_VALID;
+	*entry |= ZPCI_TABLE_LEN_RTX;
+}
+
+static inline void validate_st_entry(unsigned long *entry)
+{
+	*entry &= ~ZPCI_TABLE_VALID_MASK;
+	*entry |= ZPCI_TABLE_VALID;
+}
+
+static inline void invalidate_pt_entry(unsigned long *entry)
+{
+	WARN_ON_ONCE((*entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_INVALID);
+	*entry &= ~ZPCI_PTE_VALID_MASK;
+	*entry |= ZPCI_PTE_INVALID;
+}
+
+static inline void validate_pt_entry(unsigned long *entry)
+{
+	WARN_ON_ONCE((*entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID);
+	*entry &= ~ZPCI_PTE_VALID_MASK;
+	*entry |= ZPCI_PTE_VALID;
+}
+
+static inline void entry_set_protected(unsigned long *entry)
+{
+	*entry &= ~ZPCI_TABLE_PROT_MASK;
+	*entry |= ZPCI_TABLE_PROTECTED;
+}
+
+static inline void entry_clr_protected(unsigned long *entry)
+{
+	*entry &= ~ZPCI_TABLE_PROT_MASK;
+	*entry |= ZPCI_TABLE_UNPROTECTED;
+}
+
+static inline int reg_entry_isvalid(unsigned long entry)
+{
+	return (entry & ZPCI_TABLE_VALID_MASK) == ZPCI_TABLE_VALID;
+}
+
+static inline int pt_entry_isvalid(unsigned long entry)
+{
+	return (entry & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID;
+}
+
+static inline unsigned long *get_rt_sto(unsigned long entry)
+{
+	if ((entry & ZPCI_TABLE_TYPE_MASK) == ZPCI_TABLE_TYPE_RTX)
+		return phys_to_virt(entry & ZPCI_RTE_ADDR_MASK);
+	else
+		return NULL;
+}
+
+static inline unsigned long *get_st_pto(unsigned long entry)
+{
+	if ((entry & ZPCI_TABLE_TYPE_MASK) == ZPCI_TABLE_TYPE_SX)
+		return phys_to_virt(entry & ZPCI_STE_ADDR_MASK);
+	else
+		return NULL;
+}
+
+static int __init dma_alloc_cpu_table_caches(void)
+{
+	dma_region_table_cache = kmem_cache_create("PCI_DMA_region_tables",
+						   ZPCI_TABLE_SIZE,
+						   ZPCI_TABLE_ALIGN,
+						   0, NULL);
+	if (!dma_region_table_cache)
+		return -ENOMEM;
+
+	dma_page_table_cache = kmem_cache_create("PCI_DMA_page_tables",
+						 ZPCI_PT_SIZE,
+						 ZPCI_PT_ALIGN,
+						 0, NULL);
+	if (!dma_page_table_cache) {
+		kmem_cache_destroy(dma_region_table_cache);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static unsigned long *dma_alloc_cpu_table(void)
+{
+	unsigned long *table, *entry;
+
+	table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
+	if (!table)
+		return NULL;
+
+	for (entry = table; entry < table + ZPCI_TABLE_ENTRIES; entry++)
+		*entry = ZPCI_TABLE_INVALID;
+	return table;
+}
+
+static void dma_free_cpu_table(void *table)
+{
+	kmem_cache_free(dma_region_table_cache, table);
+}
+
+static void dma_free_page_table(void *table)
+{
+	kmem_cache_free(dma_page_table_cache, table);
+}
+
+static void dma_free_seg_table(unsigned long entry)
+{
+	unsigned long *sto = get_rt_sto(entry);
+	int sx;
+
+	for (sx = 0; sx < ZPCI_TABLE_ENTRIES; sx++)
+		if (reg_entry_isvalid(sto[sx]))
+			dma_free_page_table(get_st_pto(sto[sx]));
+
+	dma_free_cpu_table(sto);
+}
+
+static void dma_cleanup_tables(unsigned long *table)
+{
+	int rtx;
+
+	if (!table)
+		return;
+
+	for (rtx = 0; rtx < ZPCI_TABLE_ENTRIES; rtx++)
+		if (reg_entry_isvalid(table[rtx]))
+			dma_free_seg_table(table[rtx]);
+
+	dma_free_cpu_table(table);
+}
+
+static unsigned long *dma_alloc_page_table(void)
+{
+	unsigned long *table, *entry;
+
+	table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
+	if (!table)
+		return NULL;
+
+	for (entry = table; entry < table + ZPCI_PT_ENTRIES; entry++)
+		*entry = ZPCI_PTE_INVALID;
+	return table;
+}
+
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
+{
+	unsigned long old_rte, rte;
+	unsigned long *sto;
+
+	rte = READ_ONCE(*rtep);
+	if (reg_entry_isvalid(rte)) {
+		sto = get_rt_sto(rte);
+	} else {
+		sto = dma_alloc_cpu_table();
+		if (!sto)
+			return NULL;
+
+		set_rt_sto(&rte, virt_to_phys(sto));
+		validate_rt_entry(&rte);
+		entry_clr_protected(&rte);
+
+		old_rte = cmpxchg(rtep, ZPCI_TABLE_INVALID, rte);
+		if (old_rte != ZPCI_TABLE_INVALID) {
+			/* Somone else was faster, use theirs */
+			dma_free_cpu_table(sto);
+			sto = get_rt_sto(old_rte);
+		}
+	}
+	return sto;
+}
+
+static unsigned long *dma_get_page_table_origin(unsigned long *step)
+{
+	unsigned long old_ste, ste;
+	unsigned long *pto;
+
+	ste = READ_ONCE(*step);
+	if (reg_entry_isvalid(ste)) {
+		pto = get_st_pto(ste);
+	} else {
+		pto = dma_alloc_page_table();
+		if (!pto)
+			return NULL;
+		set_st_pto(&ste, virt_to_phys(pto));
+		validate_st_entry(&ste);
+		entry_clr_protected(&ste);
+
+		old_ste = cmpxchg(step, ZPCI_TABLE_INVALID, ste);
+		if (old_ste != ZPCI_TABLE_INVALID) {
+			/* Somone else was faster, use theirs */
+			dma_free_page_table(pto);
+			pto = get_st_pto(old_ste);
+		}
+	}
+	return pto;
+}
+
+static unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
+{
+	unsigned long *sto, *pto;
+	unsigned int rtx, sx, px;
+
+	rtx = calc_rtx(dma_addr);
+	sto = dma_get_seg_table_origin(&rto[rtx]);
+	if (!sto)
+		return NULL;
+
+	sx = calc_sx(dma_addr);
+	pto = dma_get_page_table_origin(&sto[sx]);
+	if (!pto)
+		return NULL;
+
+	px = calc_px(dma_addr);
+	return &pto[px];
+}
+
+static void dma_update_cpu_trans(unsigned long *ptep, phys_addr_t page_addr, int flags)
+{
+	unsigned long pte;
+
+	pte = READ_ONCE(*ptep);
+	if (flags & ZPCI_PTE_INVALID) {
+		invalidate_pt_entry(&pte);
+	} else {
+		set_pt_pfaa(&pte, page_addr);
+		validate_pt_entry(&pte);
+	}
+
+	if (flags & ZPCI_TABLE_PROTECTED)
+		entry_set_protected(&pte);
+	else
+		entry_clr_protected(&pte);
+
+	xchg(ptep, pte);
+}
+
 static struct s390_domain *to_s390_domain(struct iommu_domain *dom)
 {
 	return container_of(dom, struct s390_domain, domain);
@@ -45,9 +329,14 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 {
 	struct s390_domain *s390_domain;
 
-	if (domain_type != IOMMU_DOMAIN_UNMANAGED)
+	switch (domain_type) {
+	case IOMMU_DOMAIN_DMA:
+	case IOMMU_DOMAIN_DMA_FQ:
+	case IOMMU_DOMAIN_UNMANAGED:
+		break;
+	default:
 		return NULL;
-
+	}
 	s390_domain = kzalloc(sizeof(*s390_domain), GFP_KERNEL);
 	if (!s390_domain)
 		return NULL;
@@ -86,11 +375,14 @@ static void s390_domain_free(struct iommu_domain *domain)
 	call_rcu(&s390_domain->rcu, s390_iommu_rcu_free_domain);
 }
 
-static void __s390_iommu_detach_device(struct zpci_dev *zdev)
+static void s390_iommu_detach_device(struct iommu_domain *domain,
+				     struct device *dev)
 {
-	struct s390_domain *s390_domain = zdev->s390_domain;
+	struct s390_domain *s390_domain = to_s390_domain(domain);
+	struct zpci_dev *zdev = to_zpci_dev(dev);
 	unsigned long flags;
 
+	WARN_ON(zdev->s390_domain != to_s390_domain(domain));
 	if (!s390_domain)
 		return;
 
@@ -120,9 +412,7 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
 		return -EINVAL;
 
 	if (zdev->s390_domain)
-		__s390_iommu_detach_device(zdev);
-	else if (zdev->dma_table)
-		zpci_dma_exit_device(zdev);
+		s390_iommu_detach_device(&zdev->s390_domain->domain, dev);
 
 	cc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
 				virt_to_phys(s390_domain->dma_table), &status);
@@ -144,17 +434,6 @@ static int s390_iommu_attach_device(struct iommu_domain *domain,
 	return 0;
 }
 
-static void s390_iommu_detach_device(struct iommu_domain *domain,
-				     struct device *dev)
-{
-	struct zpci_dev *zdev = to_zpci_dev(dev);
-
-	WARN_ON(zdev->s390_domain != to_s390_domain(domain));
-
-	__s390_iommu_detach_device(zdev);
-	zpci_dma_init_device(zdev);
-}
-
 static void s390_iommu_get_resv_regions(struct device *dev,
 					struct list_head *list)
 {
@@ -207,7 +486,7 @@ static void s390_iommu_release_device(struct device *dev)
 	 * to the device, but keep it attached to other devices in the group.
 	 */
 	if (zdev)
-		__s390_iommu_detach_device(zdev);
+		s390_iommu_detach_device(&zdev->s390_domain->domain, dev);
 }
 
 static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
@@ -217,6 +496,7 @@ static void s390_iommu_flush_iotlb_all(struct iommu_domain *domain)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
+		atomic64_inc(&s390_domain->ctrs.global_rpcits);
 		zpci_refresh_trans((u64)zdev->fh << 32, zdev->start_dma,
 				   zdev->end_dma - zdev->start_dma + 1);
 	}
@@ -236,6 +516,7 @@ static void s390_iommu_iotlb_sync(struct iommu_domain *domain,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
+		atomic64_inc(&s390_domain->ctrs.sync_rpcits);
 		zpci_refresh_trans((u64)zdev->fh << 32, gather->start,
 				   size);
 	}
@@ -252,6 +533,7 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		if (!zdev->tlb_refresh)
 			continue;
+		atomic64_inc(&s390_domain->ctrs.sync_map_rpcits);
 		zpci_refresh_trans((u64)zdev->fh << 32,
 				   iova, size);
 	}
@@ -332,16 +614,15 @@ static int s390_iommu_map_pages(struct iommu_domain *domain,
 	if (!IS_ALIGNED(iova | paddr, pgsize))
 		return -EINVAL;
 
-	if (!(prot & IOMMU_READ))
-		return -EINVAL;
-
 	if (!(prot & IOMMU_WRITE))
 		flags |= ZPCI_TABLE_PROTECTED;
 
 	rc = s390_iommu_validate_trans(s390_domain, paddr, iova,
-				       pgcount, flags);
-	if (!rc)
+				     pgcount, flags);
+	if (!rc) {
 		*mapped = size;
+		atomic64_add(pgcount, &s390_domain->ctrs.mapped_pages);
+	}
 
 	return rc;
 }
@@ -397,12 +678,29 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
 		return 0;
 
 	iommu_iotlb_gather_add_range(gather, iova, size);
+	atomic64_add(pgcount, &s390_domain->ctrs.unmapped_pages);
 
 	return size;
 }
 
+static void s390_iommu_probe_finalize(struct device *dev)
+{
+	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+
+	iommu_dma_forcedac = true;
+	iommu_setup_dma_ops(dev, domain->geometry.aperture_start, domain->geometry.aperture_end);
+}
+
+struct zpci_iommu_ctrs *zpci_get_iommu_ctrs(struct zpci_dev *zdev)
+{
+	if (!zdev && !zdev->s390_domain)
+		return NULL;
+	return &zdev->s390_domain->ctrs;
+}
+
 int zpci_init_iommu(struct zpci_dev *zdev)
 {
+	u64 aperture_size;
 	int rc = 0;
 
 	rc = iommu_device_sysfs_add(&zdev->iommu_dev, NULL, NULL,
@@ -414,6 +712,12 @@ int zpci_init_iommu(struct zpci_dev *zdev)
 	if (rc)
 		goto out_sysfs;
 
+	zdev->start_dma = PAGE_ALIGN(zdev->start_dma);
+	aperture_size = min3(s390_iommu_aperture,
+			     ZPCI_TABLE_SIZE_RT - zdev->start_dma,
+			     zdev->end_dma - zdev->start_dma + 1);
+	zdev->end_dma = zdev->start_dma + aperture_size - 1;
+
 	return 0;
 
 out_sysfs:
@@ -429,10 +733,49 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
 	iommu_device_sysfs_remove(&zdev->iommu_dev);
 }
 
+static int __init s390_iommu_setup(char *str)
+{
+	if (!strcmp(str, "strict")) {
+		pr_warn("s390_iommu=strict deprecated; use iommu.strict=1 instead\n");
+		iommu_set_dma_strict();
+	}
+	return 1;
+}
+
+__setup("s390_iommu=", s390_iommu_setup);
+
+static int __init s390_iommu_aperture_setup(char *str)
+{
+	if (kstrtou32(str, 10, &s390_iommu_aperture_factor))
+		s390_iommu_aperture_factor = 1;
+	return 1;
+}
+
+__setup("s390_iommu_aperture=", s390_iommu_aperture_setup);
+
+static int __init s390_iommu_init(void)
+{
+	int rc;
+
+	s390_iommu_aperture = (u64)virt_to_phys(high_memory);
+	if (!s390_iommu_aperture_factor)
+		s390_iommu_aperture = ULONG_MAX;
+	else
+		s390_iommu_aperture *= s390_iommu_aperture_factor;
+
+	rc = dma_alloc_cpu_table_caches();
+	if (rc)
+		return rc;
+
+	return rc;
+}
+subsys_initcall(s390_iommu_init);
+
 static const struct iommu_ops s390_iommu_ops = {
 	.capable = s390_iommu_capable,
 	.domain_alloc = s390_domain_alloc,
 	.probe_device = s390_iommu_probe_device,
+	.probe_finalize = s390_iommu_probe_finalize,
 	.release_device = s390_iommu_release_device,
 	.device_group = generic_device_group,
 	.pgsize_bitmap = SZ_4K,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
                   ` (2 preceding siblings ...)
  2022-11-16 17:16 ` [PATCH v2 3/7] s390/pci: Use dma-iommu layer Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-17  1:55   ` Baolu Lu
  2022-11-16 17:16 ` [PATCH v2 5/7] iommu/dma: Allow a single FQ in addition to per-CPU FQs Niklas Schnelle
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

When iommu.strict=1 is set or iommu_set_dma_strict() was called we
should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/iommu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 65a3b3d886dc..d9bf94d198df 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
 {
 	const struct iommu_ops *ops = dev_iommu_ops(dev);
 
+	if (iommu_dma_strict)
+		return IOMMU_DOMAIN_DMA;
+
 	if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
 		return IOMMU_DOMAIN_DMA;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/7] iommu/dma: Allow a single FQ in addition to per-CPU FQs
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
                   ` (3 preceding siblings ...)
  2022-11-16 17:16 ` [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 6/7] iommu/dma: Enable variable queue size and use larger single queue Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Niklas Schnelle
  6 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

In some virtualized environments, including s390 paged memory guests,
IOTLB flushes are used to update IOMMU shadow tables. Due to this, they
are much more expensive than in typical bare metal environments or
non-paged s390 guests. In addition they may parallelize more poorly in
virtualized environments. This changes the trade off for flushing IOVAs
such that minimizing the number of IOTLB flushes trumps any benefit of
cheaper queuing operations or increased paralellism.

In this scenario per-CPU flush queues pose several problems. Firstly
per-CPU memory is often quite limited prohibiting larger queues.
Secondly collecting IOVAs per-CPU but flushing via a global timeout
reduces the number of IOVAs flushed for each timeout especially on s390
where PCI interrupts are not bound to a specific CPU.

Thus let's introduce a single flush queue mode IOMMU_DOMAIN_DMA_SQ that
reuses the same queue logic but only allocates a single global queue
allowing larger batches of IOVAs to be freed at once and with larger
timeouts. This is based on an idea by Robin Murphy to allow the common
IOVA flushing code to more closely resemble the global flush behavior
used on s390's previous internal DMA API implementation.

Link: https://lore.kernel.org/linux-iommu/3e402947-61f9-b7e8-1414-fde006257b6f@arm.com/
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/dma-iommu.c  | 146 ++++++++++++++++++++++++++++---------
 drivers/iommu/iommu.c      |  15 +++-
 drivers/iommu/s390-iommu.c |  11 +++
 include/linux/iommu.h      |   7 ++
 4 files changed, 140 insertions(+), 39 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9297b741f5e8..1cdbf8579946 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -48,8 +48,11 @@ struct iommu_dma_cookie {
 		/* Full allocator for IOMMU_DMA_IOVA_COOKIE */
 		struct {
 			struct iova_domain	iovad;
-
-			struct iova_fq __percpu *fq;	/* Flush queue */
+			/* Flush queue */
+			union {
+				struct iova_fq	*single_fq;
+				struct iova_fq	__percpu *percpu_fq;
+			};
 			/* Number of TLB flushes that have been started */
 			atomic64_t		fq_flush_start_cnt;
 			/* Number of TLB flushes that have been finished */
@@ -151,25 +154,44 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
 	atomic64_inc(&cookie->fq_flush_finish_cnt);
 }
 
-static void fq_flush_timeout(struct timer_list *t)
+static void fq_flush_percpu(struct iommu_dma_cookie *cookie)
 {
-	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
 	int cpu;
 
-	atomic_set(&cookie->fq_timer_on, 0);
-	fq_flush_iotlb(cookie);
-
 	for_each_possible_cpu(cpu) {
 		unsigned long flags;
 		struct iova_fq *fq;
 
-		fq = per_cpu_ptr(cookie->fq, cpu);
+		fq = per_cpu_ptr(cookie->percpu_fq, cpu);
 		spin_lock_irqsave(&fq->lock, flags);
 		fq_ring_free(cookie, fq);
 		spin_unlock_irqrestore(&fq->lock, flags);
 	}
 }
 
+static void fq_flush_single(struct iommu_dma_cookie *cookie)
+{
+	struct iova_fq *fq = cookie->single_fq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->lock, flags);
+	fq_ring_free(cookie, fq);
+	spin_unlock_irqrestore(&fq->lock, flags);
+}
+
+static void fq_flush_timeout(struct timer_list *t)
+{
+	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
+
+	atomic_set(&cookie->fq_timer_on, 0);
+	fq_flush_iotlb(cookie);
+
+	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
+		fq_flush_percpu(cookie);
+	else
+		fq_flush_single(cookie);
+}
+
 static void queue_iova(struct iommu_dma_cookie *cookie,
 		unsigned long pfn, unsigned long pages,
 		struct list_head *freelist)
@@ -187,7 +209,11 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
 	 */
 	smp_mb();
 
-	fq = raw_cpu_ptr(cookie->fq);
+	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
+		fq = raw_cpu_ptr(cookie->percpu_fq);
+	else
+		fq = cookie->single_fq;
+
 	spin_lock_irqsave(&fq->lock, flags);
 
 	/*
@@ -218,31 +244,91 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
 			  jiffies + msecs_to_jiffies(IOVA_FQ_TIMEOUT));
 }
 
-static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
+static void iommu_dma_free_fq_single(struct iova_fq *fq)
 {
-	int cpu, idx;
+	int idx;
 
-	if (!cookie->fq)
+	if (!fq)
 		return;
+	fq_ring_for_each(idx, fq)
+		put_pages_list(&fq->entries[idx].freelist);
+	vfree(fq);
+}
+
+static void iommu_dma_free_fq_percpu(struct iova_fq __percpu *percpu_fq)
+{
+	int cpu, idx;
 
-	del_timer_sync(&cookie->fq_timer);
 	/* The IOVAs will be torn down separately, so just free our queued pages */
 	for_each_possible_cpu(cpu) {
-		struct iova_fq *fq = per_cpu_ptr(cookie->fq, cpu);
+		struct iova_fq *fq = per_cpu_ptr(percpu_fq, cpu);
 
 		fq_ring_for_each(idx, fq)
 			put_pages_list(&fq->entries[idx].freelist);
 	}
 
-	free_percpu(cookie->fq);
+	free_percpu(percpu_fq);
+}
+
+static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
+{
+	if (!cookie->fq_domain)
+		return;
+
+	del_timer_sync(&cookie->fq_timer);
+	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
+		iommu_dma_free_fq_percpu(cookie->percpu_fq);
+	else
+		iommu_dma_free_fq_single(cookie->single_fq);
+}
+
+
+static void iommu_dma_init_one_fq(struct iova_fq *fq)
+{
+	int i;
+
+	fq->head = 0;
+	fq->tail = 0;
+
+	spin_lock_init(&fq->lock);
+
+	for (i = 0; i < IOVA_FQ_SIZE; i++)
+		INIT_LIST_HEAD(&fq->entries[i].freelist);
+}
+
+static int iommu_dma_init_fq_single(struct iommu_dma_cookie *cookie)
+{
+	struct iova_fq *queue;
+
+	queue = vzalloc(sizeof(*queue));
+	if (!queue)
+		return -ENOMEM;
+	iommu_dma_init_one_fq(queue);
+	cookie->single_fq = queue;
+
+	return 0;
+}
+
+static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie)
+{
+	struct iova_fq __percpu *queue;
+	int cpu;
+
+	queue = alloc_percpu(struct iova_fq);
+	if (!queue)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu)
+		iommu_dma_init_one_fq(per_cpu_ptr(queue, cpu));
+	cookie->percpu_fq = queue;
+	return 0;
 }
 
 /* sysfs updates are serialised by the mutex of the group owning @domain */
 int iommu_dma_init_fq(struct iommu_domain *domain)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
-	struct iova_fq __percpu *queue;
-	int i, cpu;
+	int rc;
 
 	if (cookie->fq_domain)
 		return 0;
@@ -250,26 +336,16 @@ int iommu_dma_init_fq(struct iommu_domain *domain)
 	atomic64_set(&cookie->fq_flush_start_cnt,  0);
 	atomic64_set(&cookie->fq_flush_finish_cnt, 0);
 
-	queue = alloc_percpu(struct iova_fq);
-	if (!queue) {
-		pr_warn("iova flush queue initialization failed\n");
-		return -ENOMEM;
-	}
-
-	for_each_possible_cpu(cpu) {
-		struct iova_fq *fq = per_cpu_ptr(queue, cpu);
-
-		fq->head = 0;
-		fq->tail = 0;
-
-		spin_lock_init(&fq->lock);
+	if (domain->type == IOMMU_DOMAIN_DMA_FQ)
+		rc = iommu_dma_init_fq_percpu(cookie);
+	else
+		rc = iommu_dma_init_fq_single(cookie);
 
-		for (i = 0; i < IOVA_FQ_SIZE; i++)
-			INIT_LIST_HEAD(&fq->entries[i].freelist);
+	if (rc) {
+		pr_warn("iova flush queue initialization failed\n");
+		return rc;
 	}
 
-	cookie->fq = queue;
-
 	timer_setup(&cookie->fq_timer, fq_flush_timeout, 0);
 	atomic_set(&cookie->fq_timer_on, 0);
 	/*
@@ -583,7 +659,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
 		goto done_unlock;
 
 	/* If the FQ fails we can simply fall back to strict mode */
-	if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain))
+	if (!!(domain->type & __IOMMU_DOMAIN_DMA_FQ) && iommu_dma_init_fq(domain))
 		domain->type = IOMMU_DOMAIN_DMA;
 
 	ret = iova_reserve_iommu_regions(dev, domain);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d9bf94d198df..8d57c4a4c394 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -140,6 +140,7 @@ static const char *iommu_domain_type_str(unsigned int t)
 		return "Unmanaged";
 	case IOMMU_DOMAIN_DMA:
 	case IOMMU_DOMAIN_DMA_FQ:
+	case IOMMU_DOMAIN_DMA_SQ:
 		return "Translated";
 	default:
 		return "Unknown";
@@ -437,7 +438,7 @@ early_param("iommu.strict", iommu_dma_setup);
 void iommu_set_dma_strict(void)
 {
 	iommu_dma_strict = true;
-	if (iommu_def_domain_type == IOMMU_DOMAIN_DMA_FQ)
+	if (iommu_def_domain_type & __IOMMU_DOMAIN_DMA_FQ)
 		iommu_def_domain_type = IOMMU_DOMAIN_DMA;
 }
 
@@ -638,6 +639,9 @@ static ssize_t iommu_group_show_type(struct iommu_group *group,
 		case IOMMU_DOMAIN_DMA_FQ:
 			type = "DMA-FQ\n";
 			break;
+		case IOMMU_DOMAIN_DMA_SQ:
+			type = "DMA-SQ\n";
+			break;
 		}
 	}
 	mutex_unlock(&group->mutex);
@@ -2912,10 +2916,10 @@ static int iommu_change_dev_def_domain(struct iommu_group *group,
 	}
 
 	/* We can bring up a flush queue without tearing down the domain */
-	if (type == IOMMU_DOMAIN_DMA_FQ && prev_dom->type == IOMMU_DOMAIN_DMA) {
+	if (!!(type & __IOMMU_DOMAIN_DMA_FQ) && prev_dom->type == IOMMU_DOMAIN_DMA) {
 		ret = iommu_dma_init_fq(prev_dom);
 		if (!ret)
-			prev_dom->type = IOMMU_DOMAIN_DMA_FQ;
+			prev_dom->type = type;
 		goto out;
 	}
 
@@ -2986,6 +2990,8 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 		req_type = IOMMU_DOMAIN_DMA;
 	else if (sysfs_streq(buf, "DMA-FQ"))
 		req_type = IOMMU_DOMAIN_DMA_FQ;
+	else if (sysfs_streq(buf, "DMA-SQ"))
+		req_type = IOMMU_DOMAIN_DMA_SQ;
 	else if (sysfs_streq(buf, "auto"))
 		req_type = 0;
 	else
@@ -3037,7 +3043,8 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 
 	/* Check if the device in the group still has a driver bound to it */
 	device_lock(dev);
-	if (device_is_bound(dev) && !(req_type == IOMMU_DOMAIN_DMA_FQ &&
+	if (device_is_bound(dev) && !((req_type == IOMMU_DOMAIN_DMA_FQ ||
+	    req_type == IOMMU_DOMAIN_DMA_SQ) &&
 	    group->default_domain->type == IOMMU_DOMAIN_DMA)) {
 		pr_err_ratelimited("Device is still bound to driver\n");
 		ret = -EBUSY;
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 9fd788d64ac8..087bb2acff30 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -325,6 +325,15 @@ static bool s390_iommu_capable(struct device *dev, enum iommu_cap cap)
 	}
 }
 
+static int s390_iommu_def_domain_type(struct device *dev)
+{
+	struct zpci_dev *zdev = to_zpci_dev(dev);
+
+	if (zdev->tlb_refresh)
+		return IOMMU_DOMAIN_DMA_SQ;
+	return IOMMU_DOMAIN_DMA_FQ;
+}
+
 static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 {
 	struct s390_domain *s390_domain;
@@ -332,6 +341,7 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
 	switch (domain_type) {
 	case IOMMU_DOMAIN_DMA:
 	case IOMMU_DOMAIN_DMA_FQ:
+	case IOMMU_DOMAIN_DMA_SQ:
 	case IOMMU_DOMAIN_UNMANAGED:
 		break;
 	default:
@@ -773,6 +783,7 @@ subsys_initcall(s390_iommu_init);
 
 static const struct iommu_ops s390_iommu_ops = {
 	.capable = s390_iommu_capable,
+	.def_domain_type = s390_iommu_def_domain_type,
 	.domain_alloc = s390_domain_alloc,
 	.probe_device = s390_iommu_probe_device,
 	.probe_finalize = s390_iommu_probe_finalize,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3c9da1f8979e..be4f461df5ab 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -63,6 +63,7 @@ struct iommu_domain_geometry {
 					      implementation              */
 #define __IOMMU_DOMAIN_PT	(1U << 2)  /* Domain is identity mapped   */
 #define __IOMMU_DOMAIN_DMA_FQ	(1U << 3)  /* DMA-API uses flush queue    */
+#define __IOMMU_DOMAIN_DMA_SQ	(1U << 4)  /* DMA-API uses single queue   */
 
 /*
  * This are the possible domain-types
@@ -77,6 +78,8 @@ struct iommu_domain_geometry {
  *				  certain optimizations for these domains
  *	IOMMU_DOMAIN_DMA_FQ	- As above, but definitely using batched TLB
  *				  invalidation.
+ *	IOMMU_DOMAIN_DMA_SQ	- As above, but batched TLB validations use
+ *				  single global queue
  */
 #define IOMMU_DOMAIN_BLOCKED	(0U)
 #define IOMMU_DOMAIN_IDENTITY	(__IOMMU_DOMAIN_PT)
@@ -86,6 +89,10 @@ struct iommu_domain_geometry {
 #define IOMMU_DOMAIN_DMA_FQ	(__IOMMU_DOMAIN_PAGING |	\
 				 __IOMMU_DOMAIN_DMA_API |	\
 				 __IOMMU_DOMAIN_DMA_FQ)
+#define IOMMU_DOMAIN_DMA_SQ	(__IOMMU_DOMAIN_PAGING |	\
+				 __IOMMU_DOMAIN_DMA_API |	\
+				 __IOMMU_DOMAIN_DMA_FQ |	\
+				 __IOMMU_DOMAIN_DMA_SQ)
 
 struct iommu_domain {
 	unsigned type;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 6/7] iommu/dma: Enable variable queue size and use larger single queue
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
                   ` (4 preceding siblings ...)
  2022-11-16 17:16 ` [PATCH v2 5/7] iommu/dma: Allow a single FQ in addition to per-CPU FQs Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-16 17:16 ` [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Niklas Schnelle
  6 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

Flush queues currently use a fixed compile time size of 256 entries.
This being a power of 2 allows the compiler to use shifts and mask
instead of more expensive modulo operations. With per-CPU flush queues
larger queue sizes would hit per-CPU allocation limits, with
a single flush queue these limits do not apply however. As single flush
queue mode is intended for environments with epensive IOTLB flushes it
then makes sense to use a larger queue size and timeout.

To this end re-order struct iova_fq so we can use a dynamic array and
make the flush queue size and timeout variable. So as not to lose the
shift and mask optimization, check that the variable length is a power
of 2 and use explicit shift and mask instead of letting the compiler
optimize this.

For now use a large fixed queue size and timeout for single flush queues
that brings its performance on s390 paged memory guests on par with the
previous s390 specific DMA API implementation. In the future the flush
queue size can then be turned into a config option or kernel parameter.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/dma-iommu.c | 60 ++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 1cdbf8579946..3801cdf11aa8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -61,6 +61,8 @@ struct iommu_dma_cookie {
 			struct timer_list	fq_timer;
 			/* 1 when timer is active, 0 when not */
 			atomic_t		fq_timer_on;
+			/* timeout in ms */
+			unsigned long fq_timer_timeout;
 		};
 		/* Trivial linear page allocator for IOMMU_DMA_MSI_COOKIE */
 		dma_addr_t		msi_iova;
@@ -86,10 +88,16 @@ static int __init iommu_dma_forcedac_setup(char *str)
 early_param("iommu.forcedac", iommu_dma_forcedac_setup);
 
 /* Number of entries per flush queue */
-#define IOVA_FQ_SIZE	256
+#define IOVA_DEFAULT_FQ_SIZE	256
+
+/* Number of entries for a single queue */
+#define IOVA_SINGLE_FQ_SIZE	32768
 
 /* Timeout (in ms) after which entries are flushed from the queue */
-#define IOVA_FQ_TIMEOUT	10
+#define IOVA_DEFAULT_FQ_TIMEOUT	10
+
+/* Timeout (in ms) for a single queue */
+#define IOVA_SINGLE_FQ_TIMEOUT	1000
 
 /* Flush queue entry for deferred flushing */
 struct iova_fq_entry {
@@ -101,18 +109,19 @@ struct iova_fq_entry {
 
 /* Per-CPU flush queue structure */
 struct iova_fq {
-	struct iova_fq_entry entries[IOVA_FQ_SIZE];
-	unsigned int head, tail;
 	spinlock_t lock;
+	unsigned int head, tail;
+	unsigned int mod_mask;
+	struct iova_fq_entry entries[];
 };
 
 #define fq_ring_for_each(i, fq) \
-	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE)
+	for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) & (fq)->mod_mask)
 
 static inline bool fq_full(struct iova_fq *fq)
 {
 	assert_spin_locked(&fq->lock);
-	return (((fq->tail + 1) % IOVA_FQ_SIZE) == fq->head);
+	return (((fq->tail + 1) & fq->mod_mask) == fq->head);
 }
 
 static inline unsigned int fq_ring_add(struct iova_fq *fq)
@@ -121,7 +130,7 @@ static inline unsigned int fq_ring_add(struct iova_fq *fq)
 
 	assert_spin_locked(&fq->lock);
 
-	fq->tail = (idx + 1) % IOVA_FQ_SIZE;
+	fq->tail = (idx + 1) & fq->mod_mask;
 
 	return idx;
 }
@@ -143,7 +152,7 @@ static void fq_ring_free(struct iommu_dma_cookie *cookie, struct iova_fq *fq)
 			       fq->entries[idx].iova_pfn,
 			       fq->entries[idx].pages);
 
-		fq->head = (fq->head + 1) % IOVA_FQ_SIZE;
+		fq->head = (fq->head + 1) & fq->mod_mask;
 	}
 }
 
@@ -241,7 +250,7 @@ static void queue_iova(struct iommu_dma_cookie *cookie,
 	if (!atomic_read(&cookie->fq_timer_on) &&
 	    !atomic_xchg(&cookie->fq_timer_on, 1))
 		mod_timer(&cookie->fq_timer,
-			  jiffies + msecs_to_jiffies(IOVA_FQ_TIMEOUT));
+			  jiffies + msecs_to_jiffies(cookie->fq_timer_timeout));
 }
 
 static void iommu_dma_free_fq_single(struct iova_fq *fq)
@@ -283,43 +292,45 @@ static void iommu_dma_free_fq(struct iommu_dma_cookie *cookie)
 }
 
 
-static void iommu_dma_init_one_fq(struct iova_fq *fq)
+static void iommu_dma_init_one_fq(struct iova_fq *fq, unsigned int fq_size)
 {
 	int i;
 
 	fq->head = 0;
 	fq->tail = 0;
+	fq->mod_mask = fq_size - 1;
 
 	spin_lock_init(&fq->lock);
 
-	for (i = 0; i < IOVA_FQ_SIZE; i++)
+	for (i = 0; i < fq_size; i++)
 		INIT_LIST_HEAD(&fq->entries[i].freelist);
 }
 
-static int iommu_dma_init_fq_single(struct iommu_dma_cookie *cookie)
+static int iommu_dma_init_fq_single(struct iommu_dma_cookie *cookie, unsigned int fq_size)
 {
 	struct iova_fq *queue;
 
-	queue = vzalloc(sizeof(*queue));
+	queue = vzalloc(struct_size(queue, entries, fq_size));
 	if (!queue)
 		return -ENOMEM;
-	iommu_dma_init_one_fq(queue);
+	iommu_dma_init_one_fq(queue, fq_size);
 	cookie->single_fq = queue;
 
 	return 0;
 }
 
-static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie)
+static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie, unsigned int fq_size)
 {
 	struct iova_fq __percpu *queue;
 	int cpu;
 
-	queue = alloc_percpu(struct iova_fq);
+	queue = __alloc_percpu(struct_size(queue, entries, fq_size),
+			       __alignof__(*queue));
 	if (!queue)
 		return -ENOMEM;
 
 	for_each_possible_cpu(cpu)
-		iommu_dma_init_one_fq(per_cpu_ptr(queue, cpu));
+		iommu_dma_init_one_fq(per_cpu_ptr(queue, cpu), fq_size);
 	cookie->percpu_fq = queue;
 	return 0;
 }
@@ -328,24 +339,35 @@ static int iommu_dma_init_fq_percpu(struct iommu_dma_cookie *cookie)
 int iommu_dma_init_fq(struct iommu_domain *domain)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	unsigned int fq_size = IOVA_DEFAULT_FQ_SIZE;
 	int rc;
 
 	if (cookie->fq_domain)
 		return 0;
 
+	if (domain->type == IOMMU_DOMAIN_DMA_SQ)
+		fq_size = IOVA_SINGLE_FQ_SIZE;
+
+	if (!is_power_of_2(fq_size)) {
+		pr_err("FQ size must be a power of 2\n");
+		return -EINVAL;
+	}
+
 	atomic64_set(&cookie->fq_flush_start_cnt,  0);
 	atomic64_set(&cookie->fq_flush_finish_cnt, 0);
 
 	if (domain->type == IOMMU_DOMAIN_DMA_FQ)
-		rc = iommu_dma_init_fq_percpu(cookie);
+		rc = iommu_dma_init_fq_percpu(cookie, fq_size);
 	else
-		rc = iommu_dma_init_fq_single(cookie);
+		rc = iommu_dma_init_fq_single(cookie, fq_size);
 
 	if (rc) {
 		pr_warn("iova flush queue initialization failed\n");
 		return rc;
 	}
 
+	cookie->fq_timer_timeout = (domain->type == IOMMU_DOMAIN_DMA_SQ) ?
+			IOVA_SINGLE_FQ_TIMEOUT : IOVA_DEFAULT_FQ_TIMEOUT;
 	timer_setup(&cookie->fq_timer, fq_flush_timeout, 0);
 	atomic_set(&cookie->fq_timer_on, 0);
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
                   ` (5 preceding siblings ...)
  2022-11-16 17:16 ` [PATCH v2 6/7] iommu/dma: Enable variable queue size and use larger single queue Niklas Schnelle
@ 2022-11-16 17:16 ` Niklas Schnelle
  2022-11-28 14:52   ` Robin Murphy
  6 siblings, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-16 17:16 UTC (permalink / raw)
  To: Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

When RPCIT indicates that the underlying hypervisor has run out of
resources it often means that its IOVA space is exhausted and IOVAs need
to be freed before new ones can be created. By triggering a flush of the
IOVA queue we can get the queued IOVAs freed and also get the new
mapping established during the global flush.

Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/iommu/dma-iommu.c  | 14 +++++++++-----
 drivers/iommu/dma-iommu.h  |  1 +
 drivers/iommu/s390-iommu.c |  7 +++++--
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 3801cdf11aa8..54e7f63fd0d9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie)
 	spin_unlock_irqrestore(&fq->lock, flags);
 }
 
-static void fq_flush_timeout(struct timer_list *t)
+void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie)
 {
-	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
-
-	atomic_set(&cookie->fq_timer_on, 0);
 	fq_flush_iotlb(cookie);
-
 	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
 		fq_flush_percpu(cookie);
 	else
 		fq_flush_single(cookie);
 }
 
+static void fq_flush_timeout(struct timer_list *t)
+{
+	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
+
+	atomic_set(&cookie->fq_timer_on, 0);
+	iommu_dma_flush_fq(cookie);
+}
+
 static void queue_iova(struct iommu_dma_cookie *cookie,
 		unsigned long pfn, unsigned long pages,
 		struct list_head *freelist)
diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
index 942790009292..cac06030aa26 100644
--- a/drivers/iommu/dma-iommu.h
+++ b/drivers/iommu/dma-iommu.h
@@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
 void iommu_put_dma_cookie(struct iommu_domain *domain);
 
 int iommu_dma_init_fq(struct iommu_domain *domain);
+void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie);
 
 void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
 
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 087bb2acff30..9c2782c4043e 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
 {
 	struct s390_domain *s390_domain = to_s390_domain(domain);
 	struct zpci_dev *zdev;
+	int rc;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
 		if (!zdev->tlb_refresh)
 			continue;
 		atomic64_inc(&s390_domain->ctrs.sync_map_rpcits);
-		zpci_refresh_trans((u64)zdev->fh << 32,
-				   iova, size);
+		rc = zpci_refresh_trans((u64)zdev->fh << 32,
+					iova, size);
+		if (rc == -ENOMEM)
+			iommu_dma_flush_fq(domain->iova_cookie);
 	}
 	rcu_read_unlock();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-16 17:16 ` [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type Niklas Schnelle
@ 2022-11-17  1:55   ` Baolu Lu
  2022-11-28 11:10     ` Niklas Schnelle
  0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2022-11-17  1:55 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: baolu.lu, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel, Julian Ruess

On 2022/11/17 1:16, Niklas Schnelle wrote:
> When iommu.strict=1 is set or iommu_set_dma_strict() was called we
> should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
> 
> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> ---
>   drivers/iommu/iommu.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 65a3b3d886dc..d9bf94d198df 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
>   {
>   	const struct iommu_ops *ops = dev_iommu_ops(dev);
>   
> +	if (iommu_dma_strict)
> +		return IOMMU_DOMAIN_DMA;

If any quirky device must work in IOMMU identity mapping mode, this
might introduce functional regression. At least for VT-d platforms, some
devices do require IOMMU identity mapping mode for functionality.

> +
>   	if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
>   		return IOMMU_DOMAIN_DMA;
>   

Best regards,
baolu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-17  1:55   ` Baolu Lu
@ 2022-11-28 11:10     ` Niklas Schnelle
  2022-11-28 13:00       ` Baolu Lu
  2022-11-28 13:29       ` Jason Gunthorpe
  0 siblings, 2 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-28 11:10 UTC (permalink / raw)
  To: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Thu, 2022-11-17 at 09:55 +0800, Baolu Lu wrote:
> On 2022/11/17 1:16, Niklas Schnelle wrote:
> > When iommu.strict=1 is set or iommu_set_dma_strict() was called we
> > should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
> > 
> > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > ---
> >   drivers/iommu/iommu.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 65a3b3d886dc..d9bf94d198df 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
> >   {
> >   	const struct iommu_ops *ops = dev_iommu_ops(dev);
> >   
> > +	if (iommu_dma_strict)
> > +		return IOMMU_DOMAIN_DMA;
> 
> If any quirky device must work in IOMMU identity mapping mode, this
> might introduce functional regression. At least for VT-d platforms, some
> devices do require IOMMU identity mapping mode for functionality.

That's a good point. How about instead of unconditionally returning
IOMMU_DOMAIN_DMA we just do so if the domain type returned by ops-
>def_domain_type uses a flush queue (i.e. the __IOMMU_DOMAIN_DMA_FQ bit
is set). That way a device that only supports identity mapping gets to
set that but iommu_dma_strict at least always prevents use of an IOVA
flush queue.

> 
> > +
> >   	if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
> >   		return IOMMU_DOMAIN_DMA;
> >   
> 
> Best regards,
> baolu



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 11:10     ` Niklas Schnelle
@ 2022-11-28 13:00       ` Baolu Lu
  2022-11-28 13:29       ` Jason Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: Baolu Lu @ 2022-11-28 13:00 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Jason Gunthorpe, Wenjia Zhang
  Cc: baolu.lu, Pierre Morel, linux-s390, borntraeger, hca, gor,
	gerald.schaefer, agordeev, svens, linux-kernel, Julian Ruess

On 2022/11/28 19:10, Niklas Schnelle wrote:
> On Thu, 2022-11-17 at 09:55 +0800, Baolu Lu wrote:
>> On 2022/11/17 1:16, Niklas Schnelle wrote:
>>> When iommu.strict=1 is set or iommu_set_dma_strict() was called we
>>> should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
>>>
>>> Signed-off-by: Niklas Schnelle<schnelle@linux.ibm.com>
>>> ---
>>>    drivers/iommu/iommu.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>> index 65a3b3d886dc..d9bf94d198df 100644
>>> --- a/drivers/iommu/iommu.c
>>> +++ b/drivers/iommu/iommu.c
>>> @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
>>>    {
>>>    	const struct iommu_ops *ops = dev_iommu_ops(dev);
>>>    
>>> +	if (iommu_dma_strict)
>>> +		return IOMMU_DOMAIN_DMA;
>> If any quirky device must work in IOMMU identity mapping mode, this
>> might introduce functional regression. At least for VT-d platforms, some
>> devices do require IOMMU identity mapping mode for functionality.
> That's a good point. How about instead of unconditionally returning
> IOMMU_DOMAIN_DMA we just do so if the domain type returned by ops-
>> def_domain_type uses a flush queue (i.e. the __IOMMU_DOMAIN_DMA_FQ bit
> is set). That way a device that only supports identity mapping gets to
> set that but iommu_dma_strict at least always prevents use of an IOVA
> flush queue.

def_domain_type returns IOMMU_DOMAIN_DMA, IOMMU_DOMAIN_IDENTIRY or 0
(don't care). From a code perspective, you can force IOMMU_DOMAIN_DMA if
def_domain_type() returns 0.

*But* you need to document the relationship between strict mode and
default domain type somewhere and get that agreed with Robin.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 11:10     ` Niklas Schnelle
  2022-11-28 13:00       ` Baolu Lu
@ 2022-11-28 13:29       ` Jason Gunthorpe
  2022-11-28 15:54         ` Niklas Schnelle
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-11-28 13:29 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Mon, Nov 28, 2022 at 12:10:39PM +0100, Niklas Schnelle wrote:
> On Thu, 2022-11-17 at 09:55 +0800, Baolu Lu wrote:
> > On 2022/11/17 1:16, Niklas Schnelle wrote:
> > > When iommu.strict=1 is set or iommu_set_dma_strict() was called we
> > > should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
> > > 
> > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > ---
> > >   drivers/iommu/iommu.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 65a3b3d886dc..d9bf94d198df 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
> > >   {
> > >   	const struct iommu_ops *ops = dev_iommu_ops(dev);
> > >   
> > > +	if (iommu_dma_strict)
> > > +		return IOMMU_DOMAIN_DMA;
> > 
> > If any quirky device must work in IOMMU identity mapping mode, this
> > might introduce functional regression. At least for VT-d platforms, some
> > devices do require IOMMU identity mapping mode for functionality.
> 
> That's a good point. How about instead of unconditionally returning
> IOMMU_DOMAIN_DMA we just do so if the domain type returned by ops-
> >def_domain_type uses a flush queue (i.e. the __IOMMU_DOMAIN_DMA_FQ bit
> is set). That way a device that only supports identity mapping gets to
> set that but iommu_dma_strict at least always prevents use of an IOVA
> flush queue.

I would prefer we create some formal caps in iommu_ops to describe
whatever it is you are trying to do.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-16 17:16 ` [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Niklas Schnelle
@ 2022-11-28 14:52   ` Robin Murphy
  2022-11-29 12:00     ` Niklas Schnelle
  0 siblings, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-11-28 14:52 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On 2022-11-16 17:16, Niklas Schnelle wrote:
> When RPCIT indicates that the underlying hypervisor has run out of
> resources it often means that its IOVA space is exhausted and IOVAs need
> to be freed before new ones can be created. By triggering a flush of the
> IOVA queue we can get the queued IOVAs freed and also get the new
> mapping established during the global flush.

Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is 
exhausted and fail the DMA API call before even getting as far as 
iommu_map(), though? Or is there some less obvious limitation like a 
maximum total number of distinct IOVA regions regardless of size?

Other than the firmware reserved region helpers which are necessarily a 
bit pick-and-mix, I've been trying to remove all the iommu-dma details 
from drivers, so I'd really like to maintain that separation if at all 
possible.

> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> ---
>   drivers/iommu/dma-iommu.c  | 14 +++++++++-----
>   drivers/iommu/dma-iommu.h  |  1 +
>   drivers/iommu/s390-iommu.c |  7 +++++--
>   3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 3801cdf11aa8..54e7f63fd0d9 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie)
>   	spin_unlock_irqrestore(&fq->lock, flags);
>   }
>   
> -static void fq_flush_timeout(struct timer_list *t)
> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie)
>   {
> -	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> -
> -	atomic_set(&cookie->fq_timer_on, 0);
>   	fq_flush_iotlb(cookie);
> -
>   	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
>   		fq_flush_percpu(cookie);
>   	else
>   		fq_flush_single(cookie);
>   }
>   
> +static void fq_flush_timeout(struct timer_list *t)
> +{
> +	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> +
> +	atomic_set(&cookie->fq_timer_on, 0);
> +	iommu_dma_flush_fq(cookie);
> +}
> +
>   static void queue_iova(struct iommu_dma_cookie *cookie,
>   		unsigned long pfn, unsigned long pages,
>   		struct list_head *freelist)
> diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
> index 942790009292..cac06030aa26 100644
> --- a/drivers/iommu/dma-iommu.h
> +++ b/drivers/iommu/dma-iommu.h
> @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
>   void iommu_put_dma_cookie(struct iommu_domain *domain);
>   
>   int iommu_dma_init_fq(struct iommu_domain *domain);
> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie);
>   
>   void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
>   
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 087bb2acff30..9c2782c4043e 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
>   {
>   	struct s390_domain *s390_domain = to_s390_domain(domain);
>   	struct zpci_dev *zdev;
> +	int rc;
>   
>   	rcu_read_lock();
>   	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
>   		if (!zdev->tlb_refresh)
>   			continue;
>   		atomic64_inc(&s390_domain->ctrs.sync_map_rpcits);
> -		zpci_refresh_trans((u64)zdev->fh << 32,
> -				   iova, size);
> +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> +					iova, size);
> +		if (rc == -ENOMEM)
> +			iommu_dma_flush_fq(domain->iova_cookie);

Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or 
IOMMU_DOMAIN_UNMANAGED domain?

However I can't figure out how this is supposed to work anyway - 
.sync_map only gets called if .map claimed that the actual mapping(s) 
succeeded, it can't fail itself, and even if it does free up some IOVAs 
at this point by draining the flush queue, I don't see how the mapping 
then gets retried, or what happens if it still fails after that :/

Thanks,
Robin.

>   	}
>   	rcu_read_unlock();
>   }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 13:29       ` Jason Gunthorpe
@ 2022-11-28 15:54         ` Niklas Schnelle
  2022-11-28 16:35           ` Jason Gunthorpe
  2022-11-28 16:56           ` Robin Murphy
  0 siblings, 2 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-28 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Mon, 2022-11-28 at 09:29 -0400, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2022 at 12:10:39PM +0100, Niklas Schnelle wrote:
> > On Thu, 2022-11-17 at 09:55 +0800, Baolu Lu wrote:
> > > On 2022/11/17 1:16, Niklas Schnelle wrote:
> > > > When iommu.strict=1 is set or iommu_set_dma_strict() was called we
> > > > should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
> > > > 
> > > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > > ---
> > > >   drivers/iommu/iommu.c | 3 +++
> > > >   1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > index 65a3b3d886dc..d9bf94d198df 100644
> > > > --- a/drivers/iommu/iommu.c
> > > > +++ b/drivers/iommu/iommu.c
> > > > @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
> > > >   {
> > > >   	const struct iommu_ops *ops = dev_iommu_ops(dev);
> > > >   
> > > > +	if (iommu_dma_strict)
> > > > +		return IOMMU_DOMAIN_DMA;
> > > 
> > > If any quirky device must work in IOMMU identity mapping mode, this
> > > might introduce functional regression. At least for VT-d platforms, some
> > > devices do require IOMMU identity mapping mode for functionality.
> > 
> > That's a good point. How about instead of unconditionally returning
> > IOMMU_DOMAIN_DMA we just do so if the domain type returned by ops-
> > > def_domain_type uses a flush queue (i.e. the __IOMMU_DOMAIN_DMA_FQ bit
> > is set). That way a device that only supports identity mapping gets to
> > set that but iommu_dma_strict at least always prevents use of an IOVA
> > flush queue.
> 
> I would prefer we create some formal caps in iommu_ops to describe
> whatever it is you are trying to do.
> 
> Jason

I agree that there is currently a lack of distinction between what
domain types can be used (capability) and which should be used as
default (iommu.strict=<x>, iommu_set_...(), CONFIG_IOMMU_DEFAULT_DMA,
ops->def_domain_type.).

My case though is about the latter which I think has some undocumented
and surprising precedences built in at the moment. With this series we
can use all of IOMMU_DOMAIN_DMA(_FQ, _SQ) on any PCI device but we want
to default to either IOMMU_DOMAIN_DMA_FQ or IOMMU_DOMAIN_SQ based on
whether we're running in a paging hypervisor (z/VM or KVM) to get the
best performance. From a semantic point of view I felt that this is a
good match for ops->def_domain_type in that we pick a default but it's
still possible to change the domain type e.g. via sysfs. Now this had
the problem that ops->def_domain_type would cause IOMMU_DOMAIN_DMA_FQ
to be used even if iommu_set_dma_strict() was called (via
iommu.strict=1) because there is a undocumented override of ops-
>def_domain_type over iommu_def_domain_type which I believe comes from
the mixing of capability and default you also point at.

I think ideally we need two separate mechanism to determine which
domain types can be used for a particular device (capability) and for
which one to default to with the latter part having a clear precedence
between the options. Put together I think iommu.strict=1 should
override a device's preference (ops->def_domain_type) and
CONFIG_IOMMU_DEFAULT_DMA to use IOMMU_DOMAIN_DMA but of course only if
the device is capable of that. Does that sound reasonable?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 15:54         ` Niklas Schnelle
@ 2022-11-28 16:35           ` Jason Gunthorpe
  2022-11-28 21:01             ` Robin Murphy
  2022-11-28 16:56           ` Robin Murphy
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-11-28 16:35 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Robin Murphy, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Mon, Nov 28, 2022 at 04:54:03PM +0100, Niklas Schnelle wrote:

> I agree that there is currently a lack of distinction between what
> domain types can be used (capability) and which should be used as
> default (iommu.strict=<x>, iommu_set_...(), CONFIG_IOMMU_DEFAULT_DMA,
> ops->def_domain_type.).

What I would like to get to is the drivers only expose UNMANAGED,
IDENTITY and BLOCKING domains. Everything that the DMA/FQ/etc domains
were doing should be handled as some kind of cap.

Eg, after Lu's series:

https://lore.kernel.org/linux-iommu/20221128064648.1934720-1-baolu.lu@linux.intel.com/

We should be able to remove IOMMU_DOMAIN_DMA and its related from the
drivers entirely. Instead drivers will provide UNMANAGED domains and
dma-iommu.c will operate the UNMANAGED domain to provide the dma
api. We can detect if the driver supports this by set_platform_dma()
being NULL.

A statement that a driver performs better using SQ/FQ/none should be
something that is queried from the UNMANAGED domain as a guidance to
dma-iommu.c what configuration to pick if not overriden.

So, I would say what you want is some option flag, perhaps on the
domain ops: 'domain performs better with SQ or FQ'

> My case though is about the latter which I think has some undocumented
> and surprising precedences built in at the moment. With this series we
> can use all of IOMMU_DOMAIN_DMA(_FQ, _SQ) on any PCI device but we want
> to default to either IOMMU_DOMAIN_DMA_FQ or IOMMU_DOMAIN_SQ based on
> whether we're running in a paging hypervisor (z/VM or KVM) to get the
> best performance. From a semantic point of view I felt that this is a
> good match for ops->def_domain_type in that we pick a default but it's
> still possible to change the domain type e.g. via sysfs. Now this had
> the problem that ops->def_domain_type would cause IOMMU_DOMAIN_DMA_FQ
> to be used even if iommu_set_dma_strict() was called (via
> iommu.strict=1) because there is a undocumented override of ops-
> >def_domain_type over iommu_def_domain_type which I believe comes from
> the mixing of capability and default you also point at.

Yeah, this does sounds troubled.

> I think ideally we need two separate mechanism to determine which
> domain types can be used for a particular device (capability) and for
> which one to default to with the latter part having a clear precedence
> between the options. Put together I think iommu.strict=1 should
> override a device's preference (ops->def_domain_type) and
> CONFIG_IOMMU_DEFAULT_DMA to use IOMMU_DOMAIN_DMA but of course only if
> the device is capable of that. Does that sound reasonable?

Using the language above, if someone asks for strict then the
infrastructure should try to allocate an UNMANAGED domain, not an
identity domain, and not use the lazy flush algorithms in dma-iommu.c

Similarly if sysfs asks for lazy flush or identity maps then it should
get it, always.

The driver should have no say in how dma-iommu.c works beyond if it
provides the required ops functionalities, and hint(s) as to what
gives best performance.

So anything that moves closer to this direction seems like a good
choice to me.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 15:54         ` Niklas Schnelle
  2022-11-28 16:35           ` Jason Gunthorpe
@ 2022-11-28 16:56           ` Robin Murphy
  1 sibling, 0 replies; 39+ messages in thread
From: Robin Murphy @ 2022-11-28 16:56 UTC (permalink / raw)
  To: Niklas Schnelle, Jason Gunthorpe
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On 2022-11-28 15:54, Niklas Schnelle wrote:
> On Mon, 2022-11-28 at 09:29 -0400, Jason Gunthorpe wrote:
>> On Mon, Nov 28, 2022 at 12:10:39PM +0100, Niklas Schnelle wrote:
>>> On Thu, 2022-11-17 at 09:55 +0800, Baolu Lu wrote:
>>>> On 2022/11/17 1:16, Niklas Schnelle wrote:
>>>>> When iommu.strict=1 is set or iommu_set_dma_strict() was called we
>>>>> should use IOMMU_DOMAIN_DMA irrespective of ops->def_domain_type.
>>>>>
>>>>> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
>>>>> ---
>>>>>    drivers/iommu/iommu.c | 3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>>>>> index 65a3b3d886dc..d9bf94d198df 100644
>>>>> --- a/drivers/iommu/iommu.c
>>>>> +++ b/drivers/iommu/iommu.c
>>>>> @@ -1562,6 +1562,9 @@ static int iommu_get_def_domain_type(struct device *dev)
>>>>>    {
>>>>>    	const struct iommu_ops *ops = dev_iommu_ops(dev);
>>>>>    
>>>>> +	if (iommu_dma_strict)
>>>>> +		return IOMMU_DOMAIN_DMA;
>>>>
>>>> If any quirky device must work in IOMMU identity mapping mode, this
>>>> might introduce functional regression. At least for VT-d platforms, some
>>>> devices do require IOMMU identity mapping mode for functionality.
>>>
>>> That's a good point. How about instead of unconditionally returning
>>> IOMMU_DOMAIN_DMA we just do so if the domain type returned by ops-
>>>> def_domain_type uses a flush queue (i.e. the __IOMMU_DOMAIN_DMA_FQ bit
>>> is set). That way a device that only supports identity mapping gets to
>>> set that but iommu_dma_strict at least always prevents use of an IOVA
>>> flush queue.
>>
>> I would prefer we create some formal caps in iommu_ops to describe
>> whatever it is you are trying to do.
>>
>> Jason
> 
> I agree that there is currently a lack of distinction between what
> domain types can be used (capability) and which should be used as
> default (iommu.strict=<x>, iommu_set_...(), CONFIG_IOMMU_DEFAULT_DMA,
> ops->def_domain_type.).

As far as I'm concerned, the purpose of .def_domain_type is really just 
for quirks where the device needs an identity mapping, based on 
knowledge that tends to be sufficiently platform-specific that we prefer 
to delegate it to drivers. What apple-dart is doing is really just a 
workaround for not being to indicate per-instance domain type support at 
the point of the .domain_alloc call, and IIRC what mtk_iommu_v1 is doing 
is a horrible giant hack around the arch/arm DMA ops that don't 
understand IOMMU groups. Both of those situations are on the cards to be 
cleaned up, so don't take too much from them.

> My case though is about the latter which I think has some undocumented
> and surprising precedences built in at the moment. With this series we
> can use all of IOMMU_DOMAIN_DMA(_FQ, _SQ) on any PCI device but we want
> to default to either IOMMU_DOMAIN_DMA_FQ or IOMMU_DOMAIN_SQ based on
> whether we're running in a paging hypervisor (z/VM or KVM) to get the
> best performance. From a semantic point of view I felt that this is a
> good match for ops->def_domain_type in that we pick a default but it's
> still possible to change the domain type e.g. via sysfs. Now this had
> the problem that ops->def_domain_type would cause IOMMU_DOMAIN_DMA_FQ
> to be used even if iommu_set_dma_strict() was called (via
> iommu.strict=1) because there is a undocumented override of ops-
>> def_domain_type over iommu_def_domain_type which I believe comes from
> the mixing of capability and default you also point at.
> 
> I think ideally we need two separate mechanism to determine which
> domain types can be used for a particular device (capability) and for
> which one to default to with the latter part having a clear precedence
> between the options. Put together I think iommu.strict=1 should
> override a device's preference (ops->def_domain_type) and
> CONFIG_IOMMU_DEFAULT_DMA to use IOMMU_DOMAIN_DMA but of course only if
> the device is capable of that. Does that sound reasonable?

That sounds like essentially what we already have, though. The current 
logic should be thus:

1: If the device is untrusted, it gets strict translation, nothing else. 
If that won't actually work, tough.
2: If .def_domain_type returns a specific type, it is because any other 
type will not work correctly at all, so we must use that.
3: Otherwise, we compute the user's preferred default type based on 
kernel config and command line options.
4: Then we determine whether the IOMMU driver actually supports that 
type, by trying to allocate it. If allocation fails and the preferred 
type was more relaxed than IOMMU_DOMAIN_DMA, fall back to the stricter 
type and try one last time.

AFAICS the distinction and priority of those steps is pretty clear:

1: Core requirements
2: Driver-specific requirements
3: Core preference
4: Driver-specific support

Now, for step 4 we *could* potentially use static capability flags in 
place of the "try allocating different things until one succeeds", but 
that doesn't change anything other than saving the repetitive 
boilerplate in everyone's .domain_alloc implementations. The real moral 
of the story here is not to express a soft preference where it will be 
interpreted as a hard requirement.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] s390/pci: Use dma-iommu layer
  2022-11-16 17:16 ` [PATCH v2 3/7] s390/pci: Use dma-iommu layer Niklas Schnelle
@ 2022-11-28 18:03   ` Robin Murphy
  2022-12-19 15:17     ` Niklas Schnelle
  0 siblings, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-11-28 18:03 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On 2022-11-16 17:16, Niklas Schnelle wrote:
[...]
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index dc5f7a156ff5..804fb8f42d61 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -93,7 +93,7 @@ config IOMMU_DEBUGFS
>   choice
>   	prompt "IOMMU default domain type"
>   	depends on IOMMU_API
> -	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64
> +	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64 || S390
>   	default IOMMU_DEFAULT_DMA_STRICT
>   	help
>   	  Choose the type of IOMMU domain used to manage DMA API usage by
> @@ -412,6 +412,7 @@ config ARM_SMMU_V3_SVA
>   config S390_IOMMU
>   	def_bool y if S390 && PCI
>   	depends on S390 && PCI
> +	select IOMMU_DMA

Please add S390 to the condition under config IOMMU_DMA instead.

>   	select IOMMU_API
>   	help
>   	  Support for the IOMMU API for s390 PCI devices.
[...]
> @@ -45,9 +329,14 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
>   {
>   	struct s390_domain *s390_domain;
>   
> -	if (domain_type != IOMMU_DOMAIN_UNMANAGED)
> +	switch (domain_type) {
> +	case IOMMU_DOMAIN_DMA:
> +	case IOMMU_DOMAIN_DMA_FQ:

I was about to question adding this without any visible treatment of 
iommu_iotlb_gather_queued() or domain->type == IOMMU_DOMAIN_DMA_FQ as a 
whole, but I guess if you never actually free any pagetables it does 
work out OK. Bit of a timebomb if there's a chance of that ever changing 
in future, though.

> +	case IOMMU_DOMAIN_UNMANAGED:
> +		break;
> +	default:
>   		return NULL;
> -
> +	}
>   	s390_domain = kzalloc(sizeof(*s390_domain), GFP_KERNEL);
>   	if (!s390_domain)
>   		return NULL;
[...]
> @@ -397,12 +678,29 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
>   		return 0;
>   
>   	iommu_iotlb_gather_add_range(gather, iova, size);
> +	atomic64_add(pgcount, &s390_domain->ctrs.unmapped_pages);
>   
>   	return size;
>   }
>   
> +static void s390_iommu_probe_finalize(struct device *dev)
> +{
> +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +
> +	iommu_dma_forcedac = true;
> +	iommu_setup_dma_ops(dev, domain->geometry.aperture_start, domain->geometry.aperture_end);

For consistency with all but one other caller now, just pass 0 and 
U64_MAX for base and size to make it clear that they're meaningless 
(they will eventually go away as part of a bigger refactoring).

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 16:35           ` Jason Gunthorpe
@ 2022-11-28 21:01             ` Robin Murphy
  2022-11-29 17:33               ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-11-28 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Niklas Schnelle
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On 2022-11-28 16:35, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2022 at 04:54:03PM +0100, Niklas Schnelle wrote:
> 
>> I agree that there is currently a lack of distinction between what
>> domain types can be used (capability) and which should be used as
>> default (iommu.strict=<x>, iommu_set_...(), CONFIG_IOMMU_DEFAULT_DMA,
>> ops->def_domain_type.).
> 
> What I would like to get to is the drivers only expose UNMANAGED,
> IDENTITY and BLOCKING domains. Everything that the DMA/FQ/etc domains
> were doing should be handled as some kind of cap.
> 
> Eg, after Lu's series:
> 
> https://lore.kernel.org/linux-iommu/20221128064648.1934720-1-baolu.lu@linux.intel.com/
> 
> We should be able to remove IOMMU_DOMAIN_DMA and its related from the
> drivers entirely. Instead drivers will provide UNMANAGED domains and
> dma-iommu.c will operate the UNMANAGED domain to provide the dma
> api. We can detect if the driver supports this by set_platform_dma()
> being NULL.
> 
> A statement that a driver performs better using SQ/FQ/none should be
> something that is queried from the UNMANAGED domain as a guidance to
> dma-iommu.c what configuration to pick if not overriden.

Ack, I'm sure it could be cleaner overall if the driver capabilities 
didn't come in right at the end of the process with the .domain_alloc 
dance. As I've said before, I would still like to keep the domain types 
in the core code (since they already work as a set of capability flags), 
but drivers not having to deal with them directly would be good. Maybe 
we dedicate .domain_alloc to paging domains, and have separate device 
ops for .get_{blocking,identity}_domain, given that optimised 
implementations of those are likely to be static or at least per-instance.

> So, I would say what you want is some option flag, perhaps on the
> domain ops: 'domain performs better with SQ or FQ'

Although for something that's likely to be global based on whether 
running virtualised or not, I'd be inclined to try pulling that as far 
as reasonably possible towards core code.

>> My case though is about the latter which I think has some undocumented
>> and surprising precedences built in at the moment. With this series we
>> can use all of IOMMU_DOMAIN_DMA(_FQ, _SQ) on any PCI device but we want
>> to default to either IOMMU_DOMAIN_DMA_FQ or IOMMU_DOMAIN_SQ based on
>> whether we're running in a paging hypervisor (z/VM or KVM) to get the
>> best performance. From a semantic point of view I felt that this is a
>> good match for ops->def_domain_type in that we pick a default but it's
>> still possible to change the domain type e.g. via sysfs. Now this had
>> the problem that ops->def_domain_type would cause IOMMU_DOMAIN_DMA_FQ
>> to be used even if iommu_set_dma_strict() was called (via
>> iommu.strict=1) because there is a undocumented override of ops-
>>> def_domain_type over iommu_def_domain_type which I believe comes from
>> the mixing of capability and default you also point at.
> 
> Yeah, this does sounds troubled.

The initial assumption about .def_domain_type is incorrect, though. From 
there it's a straightforward path to the conclusion that introducing 
inconsistency (by using the wrong mechanism) leads to the presence of 
inconsistency.

>> I think ideally we need two separate mechanism to determine which
>> domain types can be used for a particular device (capability) and for
>> which one to default to with the latter part having a clear precedence
>> between the options. Put together I think iommu.strict=1 should
>> override a device's preference (ops->def_domain_type) and
>> CONFIG_IOMMU_DEFAULT_DMA to use IOMMU_DOMAIN_DMA but of course only if
>> the device is capable of that. Does that sound reasonable?
> 
> Using the language above, if someone asks for strict then the
> infrastructure should try to allocate an UNMANAGED domain, not an
> identity domain,

Careful, "iommu.strict" refers specifically to the invalidation policy 
for DMA API domains, and I've tried to be careful to document it as 
such. It has never been defined to have any impact on anything other 
than DMA API domains, so I don't think any should be assumed. Control of 
the basic domain type (identity vs. translation) on the command line has 
always been via separate parameters, which I think have always had 
higher priority anyway. With sysfs you can ask for anything, but you'll 
still only get it if it's safe and guaranteed to work.

> and not use the lazy flush algorithms in dma-iommu.c
> 
> Similarly if sysfs asks for lazy flush or identity maps then it should
> get it, always.

I'm hardly an advocate for trying to save users from themselves, but I 
honestly can't see any justifiable reason for not having sysfs respect 
iommu_get_def_domain_type(). If a privileged user wants to screw up the 
system they're hardly short of options already. Far worse, though, is 
that someone nefarious would only need to popularise a "make external 
dGPUs and/or other popular accelerators faster on laptops" udev rule 
that forces identity domains via sysfs, and bye bye Thunderclap mitigations.

> The driver should have no say in how dma-iommu.c works beyond if it
> provides the required ops functionalities, and hint(s) as to what
> gives best performance.

That should already be the case today, as outlined in my other mail. 
It's just somewhat more evolved than designed, so may not be so clear to 
everyone.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-28 14:52   ` Robin Murphy
@ 2022-11-29 12:00     ` Niklas Schnelle
  2022-11-29 12:53       ` Robin Murphy
  2022-11-29 13:51       ` Matthew Rosato
  0 siblings, 2 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-29 12:00 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> On 2022-11-16 17:16, Niklas Schnelle wrote:
> > When RPCIT indicates that the underlying hypervisor has run out of
> > resources it often means that its IOVA space is exhausted and IOVAs need
> > to be freed before new ones can be created. By triggering a flush of the
> > IOVA queue we can get the queued IOVAs freed and also get the new
> > mapping established during the global flush.
> 
> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is 
> exhausted and fail the DMA API call before even getting as far as 
> iommu_map(), though? Or is there some less obvious limitation like a 
> maximum total number of distinct IOVA regions regardless of size?

Well, yes and no. Your thinking is of course correct if the advertised
available IOVA space can be fully utilized without exhausting
hypervisor resources we won't trigger this case. However sadly there
are complications. The most obvious being that in QEMU/KVM the
restriction of the IOVA space to what QEMU can actually have mapped at
once was just added recently[0] prior to that we would regularly go
through this "I'm out of resources free me some IOVAs" dance with our
existing DMA API implementation where this just triggers an early cycle
of freeing all unused IOVAs followed by a global flush. On z/VM I know
of no situations where this is triggered. That said this signalling is
architected so z/VM may have corner cases where it does this. On our
bare metal hypervisor (no paging) this return code is unused and IOTLB
flushes are simply hardware cache flushes as on bare metal platforms. 

[0] 
https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/

> 
> Other than the firmware reserved region helpers which are necessarily a 
> bit pick-and-mix, I've been trying to remove all the iommu-dma details 
> from drivers, so I'd really like to maintain that separation if at all 
> possible.

Hmm, tough one. Having a flush queue implies that we're holding on to
IOVAs that we could free and this is kind of directly architected into
our IOTLB flush with this "free some IOVAs and try again" error return.

> 
> > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > ---
> >   drivers/iommu/dma-iommu.c  | 14 +++++++++-----
> >   drivers/iommu/dma-iommu.h  |  1 +
> >   drivers/iommu/s390-iommu.c |  7 +++++--
> >   3 files changed, 15 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index 3801cdf11aa8..54e7f63fd0d9 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie)
> >   	spin_unlock_irqrestore(&fq->lock, flags);
> >   }
> >   
> > -static void fq_flush_timeout(struct timer_list *t)
> > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie)
> >   {
> > -	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> > -
> > -	atomic_set(&cookie->fq_timer_on, 0);
> >   	fq_flush_iotlb(cookie);
> > -
> >   	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
> >   		fq_flush_percpu(cookie);
> >   	else
> >   		fq_flush_single(cookie);
> >   }
> >   
> > +static void fq_flush_timeout(struct timer_list *t)
> > +{
> > +	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> > +
> > +	atomic_set(&cookie->fq_timer_on, 0);
> > +	iommu_dma_flush_fq(cookie);
> > +}
> > +
> >   static void queue_iova(struct iommu_dma_cookie *cookie,
> >   		unsigned long pfn, unsigned long pages,
> >   		struct list_head *freelist)
> > diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
> > index 942790009292..cac06030aa26 100644
> > --- a/drivers/iommu/dma-iommu.h
> > +++ b/drivers/iommu/dma-iommu.h
> > @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
> >   void iommu_put_dma_cookie(struct iommu_domain *domain);
> >   
> >   int iommu_dma_init_fq(struct iommu_domain *domain);
> > +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie);
> >   
> >   void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
> >   
> > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> > index 087bb2acff30..9c2782c4043e 100644
> > --- a/drivers/iommu/s390-iommu.c
> > +++ b/drivers/iommu/s390-iommu.c
> > @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
> >   {
> >   	struct s390_domain *s390_domain = to_s390_domain(domain);
> >   	struct zpci_dev *zdev;
> > +	int rc;
> >   
> >   	rcu_read_lock();
> >   	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
> >   		if (!zdev->tlb_refresh)
> >   			continue;
> >   		atomic64_inc(&s390_domain->ctrs.sync_map_rpcits);
> > -		zpci_refresh_trans((u64)zdev->fh << 32,
> > -				   iova, size);
> > +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
> > +					iova, size);
> > +		if (rc == -ENOMEM)
> > +			iommu_dma_flush_fq(domain->iova_cookie);
> 
> Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or 
> IOMMU_DOMAIN_UNMANAGED domain?

In theory yes and then iommu_dma_flush_fq() still does the
.flush_iotlb_all to give the hypervisor a chance to look for freed
IOVAs but without flush queues you're really just running out of IOVA
space and that's futile.

This does highlight an important missed issue though. As we don't
return the resulting error from the subsequent .flush_iotlb_all we only
find out if this didn't work once the mapping is used while in our
current DMA API implementation we correctly return DMA_MAPPING_ERROR in
this case. I guess this means we do need error returns from the IOTLB
helpers since in a paged guest this is where we finally find out that
our mapping couldn't be synced to the hypervisor's  shadow table and I
don't really see a way around that. Also there are other error
conditions implied in this shadowing too, for example the hypervisor
could simply fail to pin guest memory and while we can't do anything
about that at least we should fail the mapping operation.

> 
> However I can't figure out how this is supposed to work anyway - 
> .sync_map only gets called if .map claimed that the actual mapping(s) 
> succeeded, it can't fail itself, and even if it does free up some IOVAs 
> at this point by draining the flush queue, I don't see how the mapping 
> then gets retried, or what happens if it still fails after that :/
> 
> Thanks,
> Robin.

Yeah, this is a bit non obvious and you are correct in that the
architecture requires a subseqeunt IOTLB flush i.e. retry for the range
that returned the error. And your last point is then exactly the issue
above that we miss if the retry still failed.

As for the good path, in the mapping operation but before the .sync_map
we have updated the IOMMU translation table so the translation is the
recorded but not synced to the hypervisor's shadow table. Now when the
.sync_map is called and the IOTLB flush returns -ENOMEM then
iommu_dma_flush_fq() will call .flush_iotlb_all which causes the
hypervisor to look at the entire guest translation table and shadow all
translations that were not yet shadowed. I.e. the .flush_iotlb_all
"retries" the failed .sync_map.

> 
> >   	}
> >   	rcu_read_unlock();
> >   }



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-29 12:00     ` Niklas Schnelle
@ 2022-11-29 12:53       ` Robin Murphy
  2022-11-29 14:40         ` Niklas Schnelle
  2022-11-29 13:51       ` Matthew Rosato
  1 sibling, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-11-29 12:53 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On 2022-11-29 12:00, Niklas Schnelle wrote:
> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
>> On 2022-11-16 17:16, Niklas Schnelle wrote:
>>> When RPCIT indicates that the underlying hypervisor has run out of
>>> resources it often means that its IOVA space is exhausted and IOVAs need
>>> to be freed before new ones can be created. By triggering a flush of the
>>> IOVA queue we can get the queued IOVAs freed and also get the new
>>> mapping established during the global flush.
>>
>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
>> exhausted and fail the DMA API call before even getting as far as
>> iommu_map(), though? Or is there some less obvious limitation like a
>> maximum total number of distinct IOVA regions regardless of size?
> 
> Well, yes and no. Your thinking is of course correct if the advertised
> available IOVA space can be fully utilized without exhausting
> hypervisor resources we won't trigger this case. However sadly there
> are complications. The most obvious being that in QEMU/KVM the
> restriction of the IOVA space to what QEMU can actually have mapped at
> once was just added recently[0] prior to that we would regularly go
> through this "I'm out of resources free me some IOVAs" dance with our
> existing DMA API implementation where this just triggers an early cycle
> of freeing all unused IOVAs followed by a global flush. On z/VM I know
> of no situations where this is triggered. That said this signalling is
> architected so z/VM may have corner cases where it does this. On our
> bare metal hypervisor (no paging) this return code is unused and IOTLB
> flushes are simply hardware cache flushes as on bare metal platforms.
> 
> [0]
> https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/

That sheds a bit more light, thanks, although I'm still not confident I 
fully understand the whole setup. AFAICS that patch looks to me like 
it's putting a fixed limit on the size of the usable address space. That 
in turn implies that "free some IOVAs and try again" might be a red 
herring and never going to work; for your current implementation, what 
that presumably means in reality is "free some IOVAs, resetting the 
allocator to start allocating lower down in the address space where it 
will happen to be below that limit, and try again", but the iommu-dma 
allocator won't do that. If it doesn't know that some arbitrary range 
below the top of the driver-advertised aperture is unusable, it will 
just keep allocating IOVAs up there and mappings will always fail.

If the driver can't accurately represent the usable IOVA space via the 
aperture and/or reserved regions, then this whole approach seems doomed. 
If on the other hand I've misunderstood and you can actually still use 
any address, just not all of them at the same time, then it might in 
fact be considerably easier to skip the flush queue mechanism entirely 
and implement this internally to the driver - basically make .iotlb_sync 
a no-op for non-strict DMA domains, put the corresponding RPCIT flush 
and retry in .sync_map, then allow that to propagate an error back to 
iommu_map() if the new mapping still hasn't taken.

Thanks,
Robin.

>> Other than the firmware reserved region helpers which are necessarily a
>> bit pick-and-mix, I've been trying to remove all the iommu-dma details
>> from drivers, so I'd really like to maintain that separation if at all
>> possible.
> 
> Hmm, tough one. Having a flush queue implies that we're holding on to
> IOVAs that we could free and this is kind of directly architected into
> our IOTLB flush with this "free some IOVAs and try again" error return.
> 
>>
>>> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
>>> ---
>>>    drivers/iommu/dma-iommu.c  | 14 +++++++++-----
>>>    drivers/iommu/dma-iommu.h  |  1 +
>>>    drivers/iommu/s390-iommu.c |  7 +++++--
>>>    3 files changed, 15 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index 3801cdf11aa8..54e7f63fd0d9 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -188,19 +188,23 @@ static void fq_flush_single(struct iommu_dma_cookie *cookie)
>>>    	spin_unlock_irqrestore(&fq->lock, flags);
>>>    }
>>>    
>>> -static void fq_flush_timeout(struct timer_list *t)
>>> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie)
>>>    {
>>> -	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
>>> -
>>> -	atomic_set(&cookie->fq_timer_on, 0);
>>>    	fq_flush_iotlb(cookie);
>>> -
>>>    	if (cookie->fq_domain->type == IOMMU_DOMAIN_DMA_FQ)
>>>    		fq_flush_percpu(cookie);
>>>    	else
>>>    		fq_flush_single(cookie);
>>>    }
>>>    
>>> +static void fq_flush_timeout(struct timer_list *t)
>>> +{
>>> +	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
>>> +
>>> +	atomic_set(&cookie->fq_timer_on, 0);
>>> +	iommu_dma_flush_fq(cookie);
>>> +}
>>> +
>>>    static void queue_iova(struct iommu_dma_cookie *cookie,
>>>    		unsigned long pfn, unsigned long pages,
>>>    		struct list_head *freelist)
>>> diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
>>> index 942790009292..cac06030aa26 100644
>>> --- a/drivers/iommu/dma-iommu.h
>>> +++ b/drivers/iommu/dma-iommu.h
>>> @@ -13,6 +13,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain);
>>>    void iommu_put_dma_cookie(struct iommu_domain *domain);
>>>    
>>>    int iommu_dma_init_fq(struct iommu_domain *domain);
>>> +void iommu_dma_flush_fq(struct iommu_dma_cookie *cookie);
>>>    
>>>    void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
>>>    
>>> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
>>> index 087bb2acff30..9c2782c4043e 100644
>>> --- a/drivers/iommu/s390-iommu.c
>>> +++ b/drivers/iommu/s390-iommu.c
>>> @@ -538,14 +538,17 @@ static void s390_iommu_iotlb_sync_map(struct iommu_domain *domain,
>>>    {
>>>    	struct s390_domain *s390_domain = to_s390_domain(domain);
>>>    	struct zpci_dev *zdev;
>>> +	int rc;
>>>    
>>>    	rcu_read_lock();
>>>    	list_for_each_entry_rcu(zdev, &s390_domain->devices, iommu_list) {
>>>    		if (!zdev->tlb_refresh)
>>>    			continue;
>>>    		atomic64_inc(&s390_domain->ctrs.sync_map_rpcits);
>>> -		zpci_refresh_trans((u64)zdev->fh << 32,
>>> -				   iova, size);
>>> +		rc = zpci_refresh_trans((u64)zdev->fh << 32,
>>> +					iova, size);
>>> +		if (rc == -ENOMEM)
>>> +			iommu_dma_flush_fq(domain->iova_cookie);
>>
>> Could -ENOMEM ever be returned for some reason on an IOMMU_DOMAIN_DMA or
>> IOMMU_DOMAIN_UNMANAGED domain?
> 
> In theory yes and then iommu_dma_flush_fq() still does the
> .flush_iotlb_all to give the hypervisor a chance to look for freed
> IOVAs but without flush queues you're really just running out of IOVA
> space and that's futile.
> 
> This does highlight an important missed issue though. As we don't
> return the resulting error from the subsequent .flush_iotlb_all we only
> find out if this didn't work once the mapping is used while in our
> current DMA API implementation we correctly return DMA_MAPPING_ERROR in
> this case. I guess this means we do need error returns from the IOTLB
> helpers since in a paged guest this is where we finally find out that
> our mapping couldn't be synced to the hypervisor's  shadow table and I
> don't really see a way around that. Also there are other error
> conditions implied in this shadowing too, for example the hypervisor
> could simply fail to pin guest memory and while we can't do anything
> about that at least we should fail the mapping operation.
> 
>>
>> However I can't figure out how this is supposed to work anyway -
>> .sync_map only gets called if .map claimed that the actual mapping(s)
>> succeeded, it can't fail itself, and even if it does free up some IOVAs
>> at this point by draining the flush queue, I don't see how the mapping
>> then gets retried, or what happens if it still fails after that :/
>>
>> Thanks,
>> Robin.
> 
> Yeah, this is a bit non obvious and you are correct in that the
> architecture requires a subseqeunt IOTLB flush i.e. retry for the range
> that returned the error. And your last point is then exactly the issue
> above that we miss if the retry still failed.
> 
> As for the good path, in the mapping operation but before the .sync_map
> we have updated the IOMMU translation table so the translation is the
> recorded but not synced to the hypervisor's shadow table. Now when the
> .sync_map is called and the IOTLB flush returns -ENOMEM then
> iommu_dma_flush_fq() will call .flush_iotlb_all which causes the
> hypervisor to look at the entire guest translation table and shadow all
> translations that were not yet shadowed. I.e. the .flush_iotlb_all
> "retries" the failed .sync_map.
> 
>>
>>>    	}
>>>    	rcu_read_unlock();
>>>    }
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-29 12:00     ` Niklas Schnelle
  2022-11-29 12:53       ` Robin Murphy
@ 2022-11-29 13:51       ` Matthew Rosato
  1 sibling, 0 replies; 39+ messages in thread
From: Matthew Rosato @ 2022-11-29 13:51 UTC (permalink / raw)
  To: Niklas Schnelle, Robin Murphy, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On 11/29/22 7:00 AM, Niklas Schnelle wrote:
> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
>> On 2022-11-16 17:16, Niklas Schnelle wrote:
>>> When RPCIT indicates that the underlying hypervisor has run out of
>>> resources it often means that its IOVA space is exhausted and IOVAs need
>>> to be freed before new ones can be created. By triggering a flush of the
>>> IOVA queue we can get the queued IOVAs freed and also get the new
>>> mapping established during the global flush.
>>
>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is 
>> exhausted and fail the DMA API call before even getting as far as 
>> iommu_map(), though? Or is there some less obvious limitation like a 
>> maximum total number of distinct IOVA regions regardless of size?
> 
> Well, yes and no. Your thinking is of course correct if the advertised
> available IOVA space can be fully utilized without exhausting
> hypervisor resources we won't trigger this case. However sadly there
> are complications. The most obvious being that in QEMU/KVM the
> restriction of the IOVA space to what QEMU can actually have mapped at
> once was just added recently[0] prior to that we would regularly go
> through this "I'm out of resources free me some IOVAs" dance with our
> existing DMA API implementation where this just triggers an early cycle
> of freeing all unused IOVAs followed by a global flush. On z/VM I know
> of no situations where this is triggered. That said this signalling is

While the QEMU case made for an easily-reproducible scenario, the indication was really provided to handle a scenario where you have multiple pageable guests whose sum of total memory is overcommitted as compared to the hypervisor's resources.  The intent is for the entire advertised guest aperture to be usable and generally-speaking it is, but it's possible to find yourself in a (poorly tuned) scenario where the hypervisor is unable to pin additional pages (basically an OOM condition) -- the hypervisor (qemu/kvm or z/VM) can use this as a cry for help to say "stop what you're doing and flush your queues immediately so I can unpin as much as possible", and then after that the guest(s) can continue using their aperture.  

This is unnecessary for the no-paging bare metal hypervisor because there you aren't over-committing memory.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-29 12:53       ` Robin Murphy
@ 2022-11-29 14:40         ` Niklas Schnelle
  2022-12-02 14:29           ` Niklas Schnelle
  0 siblings, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-11-29 14:40 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> On 2022-11-29 12:00, Niklas Schnelle wrote:
> > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > to be freed before new ones can be created. By triggering a flush of the
> > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > mapping established during the global flush.
> > > 
> > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > exhausted and fail the DMA API call before even getting as far as
> > > iommu_map(), though? Or is there some less obvious limitation like a
> > > maximum total number of distinct IOVA regions regardless of size?
> > 
> > Well, yes and no. Your thinking is of course correct if the advertised
> > available IOVA space can be fully utilized without exhausting
> > hypervisor resources we won't trigger this case. However sadly there
> > are complications. The most obvious being that in QEMU/KVM the
> > restriction of the IOVA space to what QEMU can actually have mapped at
> > once was just added recently[0] prior to that we would regularly go
> > through this "I'm out of resources free me some IOVAs" dance with our
> > existing DMA API implementation where this just triggers an early cycle
> > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > of no situations where this is triggered. That said this signalling is
> > architected so z/VM may have corner cases where it does this. On our
> > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > flushes are simply hardware cache flushes as on bare metal platforms.
> > 
> > [0]
> > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/
> 
> That sheds a bit more light, thanks, although I'm still not confident I 
> fully understand the whole setup. AFAICS that patch looks to me like 
> it's putting a fixed limit on the size of the usable address space. That 
> in turn implies that "free some IOVAs and try again" might be a red 
> herring and never going to work; for your current implementation, what 
> that presumably means in reality is "free some IOVAs, resetting the 
> allocator to start allocating lower down in the address space where it 
> will happen to be below that limit, and try again", but the iommu-dma 
> allocator won't do that. If it doesn't know that some arbitrary range 
> below the top of the driver-advertised aperture is unusable, it will 
> just keep allocating IOVAs up there and mappings will always fail.
> 
> If the driver can't accurately represent the usable IOVA space via the 
> aperture and/or reserved regions, then this whole approach seems doomed. 
> If on the other hand I've misunderstood and you can actually still use 
> any address, just not all of them at the same time,


This is exactly it, the problem is a limit on the number of IOVAs that
are concurrently mapped. In QEMU pass-through the tightest limit is
usually the one set by the host kernel parameter
vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
there is a flush queue (including the existing per-CPU one) where each
entry may keep many pages lazily unmapped this is easly hit with fio
bandwidth tests on an NVMe. For this case this patch works reliably
because of course the number of actually active mappings without the
lazily freed ones is similar to the number of active ones with
IOMMU_DOMAIN_DMA.

>  then it might in 
> fact be considerably easier to skip the flush queue mechanism entirely 
> and implement this internally to the driver - basically make .iotlb_sync 
> a no-op for non-strict DMA domains,

I'm assuming you mean .iotlb_sync_map above.

>  put the corresponding RPCIT flush 
> and retry in .sync_map, then allow that to propagate an error back to 
> iommu_map() if the new mapping still hasn't taken.
> 
> Thanks,
> Robin.

Hmm, interesting. This would leave the IOVAs in the flush queue lazily
unmapped and thus still block their re-use but free their host
resources via a global RPCIT allowing the guest to use a different
porition of the IOVA space with those resources. It could work, though
I need to test it, but it feels a bit clunky.

Maybe we can go cleaner while using this idea of not having to flush
the queue but just freeing their host side resources. If we allowed
.iotlb_sync_map to return an error that fails the mapping operation,
then we could do it all in there. In the normal case it just does the
RPCIT but if that returns that the hypervisor ran out of resources it
does another global RPCIT allowing the hypervisor to free IOVAs that
were lazily unmapped. If the latter succeeds all is good if not then
the mapping operation failed. Logically it makes sense too,
.iotlb_sync_map is the final step of syncing the mapping to the host
which can fail just like the mapping operation itself.

Apart from the out of active IOVAs case this would also handle the
other useful error case when using .iotlb_sync_map for shadowing where
it fails because the host ran against a pinned pages limit or out of
actual memory. Not by fixing it but at least we would get a failed
mapping operation.

The other callbacks .flush_iotlb_all and .iotlb_sync
could stay the same as they are only used for unmapped pages where we
can't reasonably run out of resources in the host neither active IOVAs
nor pinned pages.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-28 21:01             ` Robin Murphy
@ 2022-11-29 17:33               ` Jason Gunthorpe
  2022-11-29 18:41                 ` Robin Murphy
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-11-29 17:33 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Niklas Schnelle, Baolu Lu, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:

> I'm hardly an advocate for trying to save users from themselves, but I
> honestly can't see any justifiable reason for not having sysfs respect
> iommu_get_def_domain_type(). 

We really need to rename this value if it is not actually just an
advisory "default" but a functional requirement ..

> > The driver should have no say in how dma-iommu.c works beyond if it
> > provides the required ops functionalities, and hint(s) as to what
> > gives best performance.
> 
> That should already be the case today, as outlined in my other mail. It's
> just somewhat more evolved than designed, so may not be so clear to
> everyone.

It is close to being clear, once we get the last touches of dma-iommu
stuff out of the drivers it should be quite clear

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-29 17:33               ` Jason Gunthorpe
@ 2022-11-29 18:41                 ` Robin Murphy
  2022-11-29 20:09                   ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-11-29 18:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Niklas Schnelle, Baolu Lu, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On 2022-11-29 17:33, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:
> 
>> I'm hardly an advocate for trying to save users from themselves, but I
>> honestly can't see any justifiable reason for not having sysfs respect
>> iommu_get_def_domain_type().
> 
> We really need to rename this value if it is not actually just an
> advisory "default" but a functional requirement ..

It represents a required default domain type. As in, the type for the 
device's default domain. Not the default type for a domain. It's the 
iommu_def_domain_type variable that holds the *default* default domain 
type ;)

Which reminds me I should finish that patch undoing my terrible 
ops->default_domain_ops idea, not least because they are misleadingly 
unrelated to default domains...

>>> The driver should have no say in how dma-iommu.c works beyond if it
>>> provides the required ops functionalities, and hint(s) as to what
>>> gives best performance.
>>
>> That should already be the case today, as outlined in my other mail. It's
>> just somewhat more evolved than designed, so may not be so clear to
>> everyone.
> 
> It is close to being clear, once we get the last touches of dma-iommu
> stuff out of the drivers it should be quite clear

Cool, some upheaval of .domain_alloc is next on my hitlist anyway, so 
that might be a good excuse to upheave it a bit more and streamline the 
type stuff along the way.

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-29 18:41                 ` Robin Murphy
@ 2022-11-29 20:09                   ` Jason Gunthorpe
  2022-11-30  1:28                     ` Baolu Lu
  2022-12-05 15:34                     ` Niklas Schnelle
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2022-11-29 20:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Niklas Schnelle, Baolu Lu, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Tue, Nov 29, 2022 at 06:41:22PM +0000, Robin Murphy wrote:
> On 2022-11-29 17:33, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:
> > 
> > > I'm hardly an advocate for trying to save users from themselves, but I
> > > honestly can't see any justifiable reason for not having sysfs respect
> > > iommu_get_def_domain_type().
> > 
> > We really need to rename this value if it is not actually just an
> > advisory "default" but a functional requirement ..
> 
> It represents a required default domain type. As in, the type for the
> device's default domain. Not the default type for a domain. It's the
> iommu_def_domain_type variable that holds the *default* default domain type
> ;)

I find the name "default domain" incredibly confusing at this point in
time.

I would like to call that the "dma-api domain" - its primary purpose
is to be the domain that the DMA API uses to operate the IOMMU, there
is little "default" about it. This meshes better with our apis talking
about ownership and so forth.

So, if the op was called
  get_dma_api_domain_type()

It is pretty clear that it is the exact type of domain that should be
created to support the DMA API, which is what I think you have been
describing it is supposed to do?

And with Lu's series we have the set_platform_dma() (Lu perhaps you
should call this set_platform_dma_api() to re-enforce it is about the
DMA API, not some nebulous DMA thing)

Which is basically the other way to configure the DMA API for
operation.

And encapsulating more of the logic to setup and manage the DMA API's
domain into dma-iommu.c would also be helpful to understanding.

> Which reminds me I should finish that patch undoing my terrible
> ops->default_domain_ops idea, not least because they are misleadingly
> unrelated to default domains...

:)

> > It is close to being clear, once we get the last touches of dma-iommu
> > stuff out of the drivers it should be quite clear
> 
> Cool, some upheaval of .domain_alloc is next on my hitlist anyway, so that
> might be a good excuse to upheave it a bit more and streamline the type
> stuff along the way.

Yes, I think so. I want to tidy things a bit so adding this "user
space" domain concept is a little nicer

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-29 20:09                   ` Jason Gunthorpe
@ 2022-11-30  1:28                     ` Baolu Lu
  2022-12-05 15:34                     ` Niklas Schnelle
  1 sibling, 0 replies; 39+ messages in thread
From: Baolu Lu @ 2022-11-30  1:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy
  Cc: baolu.lu, Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On 2022/11/30 4:09, Jason Gunthorpe wrote:
> On Tue, Nov 29, 2022 at 06:41:22PM +0000, Robin Murphy wrote:
>> On 2022-11-29 17:33, Jason Gunthorpe wrote:
>>> On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:
>>>
>>>> I'm hardly an advocate for trying to save users from themselves, but I
>>>> honestly can't see any justifiable reason for not having sysfs respect
>>>> iommu_get_def_domain_type().
>>>
>>> We really need to rename this value if it is not actually just an
>>> advisory "default" but a functional requirement ..
>>
>> It represents a required default domain type. As in, the type for the
>> device's default domain. Not the default type for a domain. It's the
>> iommu_def_domain_type variable that holds the *default* default domain type
>> ;)
> 
> I find the name "default domain" incredibly confusing at this point in
> time.
> 
> I would like to call that the "dma-api domain" - its primary purpose
> is to be the domain that the DMA API uses to operate the IOMMU, there
> is little "default" about it. This meshes better with our apis talking
> about ownership and so forth.
> 
> So, if the op was called
>    get_dma_api_domain_type()
> 
> It is pretty clear that it is the exact type of domain that should be
> created to support the DMA API, which is what I think you have been
> describing it is supposed to do?
> 
> And with Lu's series we have the set_platform_dma() (Lu perhaps you
> should call this set_platform_dma_api() to re-enforce it is about the
> DMA API, not some nebulous DMA thing)

Sure thing. It's more specific.

> 
> Which is basically the other way to configure the DMA API for
> operation.
> 
> And encapsulating more of the logic to setup and manage the DMA API's
> domain into dma-iommu.c would also be helpful to understanding.
> 
>> Which reminds me I should finish that patch undoing my terrible
>> ops->default_domain_ops idea, not least because they are misleadingly
>> unrelated to default domains...
> 
> :)
> 
>>> It is close to being clear, once we get the last touches of dma-iommu
>>> stuff out of the drivers it should be quite clear
>>
>> Cool, some upheaval of .domain_alloc is next on my hitlist anyway, so that
>> might be a good excuse to upheave it a bit more and streamline the type
>> stuff along the way.
> 
> Yes, I think so. I want to tidy things a bit so adding this "user
> space" domain concept is a little nicer
> 
> Jason
> 

--
Best regards,
baolu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-11-29 14:40         ` Niklas Schnelle
@ 2022-12-02 14:29           ` Niklas Schnelle
  2022-12-02 14:42             ` Jason Gunthorpe
  2022-12-05 18:24             ` Robin Murphy
  0 siblings, 2 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-12-02 14:29 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
> On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> > On 2022-11-29 12:00, Niklas Schnelle wrote:
> > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > > to be freed before new ones can be created. By triggering a flush of the
> > > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > > mapping established during the global flush.
> > > > 
> > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > > exhausted and fail the DMA API call before even getting as far as
> > > > iommu_map(), though? Or is there some less obvious limitation like a
> > > > maximum total number of distinct IOVA regions regardless of size?
> > > 
> > > Well, yes and no. Your thinking is of course correct if the advertised
> > > available IOVA space can be fully utilized without exhausting
> > > hypervisor resources we won't trigger this case. However sadly there
> > > are complications. The most obvious being that in QEMU/KVM the
> > > restriction of the IOVA space to what QEMU can actually have mapped at
> > > once was just added recently[0] prior to that we would regularly go
> > > through this "I'm out of resources free me some IOVAs" dance with our
> > > existing DMA API implementation where this just triggers an early cycle
> > > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > > of no situations where this is triggered. That said this signalling is
> > > architected so z/VM may have corner cases where it does this. On our
> > > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > > flushes are simply hardware cache flushes as on bare metal platforms.
> > > 
> > > [0]
> > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/
> > 
> > That sheds a bit more light, thanks, although I'm still not confident I 
> > fully understand the whole setup. AFAICS that patch looks to me like 
> > it's putting a fixed limit on the size of the usable address space. That 
> > in turn implies that "free some IOVAs and try again" might be a red 
> > herring and never going to work; for your current implementation, what 
> > that presumably means in reality is "free some IOVAs, resetting the 
> > allocator to start allocating lower down in the address space where it 
> > will happen to be below that limit, and try again", but the iommu-dma 
> > allocator won't do that. If it doesn't know that some arbitrary range 
> > below the top of the driver-advertised aperture is unusable, it will 
> > just keep allocating IOVAs up there and mappings will always fail.
> > 
> > If the driver can't accurately represent the usable IOVA space via the 
> > aperture and/or reserved regions, then this whole approach seems doomed. 
> > If on the other hand I've misunderstood and you can actually still use 
> > any address, just not all of them at the same time,
> 
> 
> This is exactly it, the problem is a limit on the number of IOVAs that
> are concurrently mapped. In QEMU pass-through the tightest limit is
> usually the one set by the host kernel parameter
> vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
> IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
> there is a flush queue (including the existing per-CPU one) where each
> entry may keep many pages lazily unmapped this is easly hit with fio
> bandwidth tests on an NVMe. For this case this patch works reliably
> because of course the number of actually active mappings without the
> lazily freed ones is similar to the number of active ones with
> IOMMU_DOMAIN_DMA.
> 
> >  then it might in 
> > fact be considerably easier to skip the flush queue mechanism entirely 
> > and implement this internally to the driver - basically make .iotlb_sync 
> > a no-op for non-strict DMA domains,
> 
> I'm assuming you mean .iotlb_sync_map above.
> 
> >  put the corresponding RPCIT flush 
> > and retry in .sync_map, then allow that to propagate an error back to 
> > iommu_map() if the new mapping still hasn't taken.
> > 
> > Thanks,
> > Robin.
> 
> Hmm, interesting. This would leave the IOVAs in the flush queue lazily
> unmapped and thus still block their re-use but free their host
> resources via a global RPCIT allowing the guest to use a different
> porition of the IOVA space with those resources. It could work, though
> I need to test it, but it feels a bit clunky.
> 
> Maybe we can go cleaner while using this idea of not having to flush
> the queue but just freeing their host side resources. If we allowed
> .iotlb_sync_map to return an error that fails the mapping operation,
> then we could do it all in there. In the normal case it just does the
> RPCIT but if that returns that the hypervisor ran out of resources it
> does another global RPCIT allowing the hypervisor to free IOVAs that
> were lazily unmapped. If the latter succeeds all is good if not then
> the mapping operation failed. Logically it makes sense too,
> .iotlb_sync_map is the final step of syncing the mapping to the host
> which can fail just like the mapping operation itself.
> 
> Apart from the out of active IOVAs case this would also handle the
> other useful error case when using .iotlb_sync_map for shadowing where
> it fails because the host ran against a pinned pages limit or out of
> actual memory. Not by fixing it but at least we would get a failed
> mapping operation.
> 
> The other callbacks .flush_iotlb_all and .iotlb_sync
> could stay the same as they are only used for unmapped pages where we
> can't reasonably run out of resources in the host neither active IOVAs
> nor pinned pages.
> 

Ok, I've done some testing with the above idea and this seems to work
great. I've verified that my version of QEMU (without Matt's IOVA
aperture resrtriction patch) creates the RPCIT out of resource
indications and then the global flush in .iotlb_sync_map is triggered
and allows QEMU to unpin pages and free IOVAs while the guest still has
them lazily unpapeg (sitting in the flush queue) and thus uses
different IOVAs.

@Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
can return and error and then failing the mapping operation I think
this is a great approach. One advantage over the previous approach of
flushing the queue isthat this should work for the pure IOMMU API too.

If you don't want to change the signature of .iotlb_sync_map I think we
can do Robin's idea and have .iotlb_sync_map as a no-op and do the
RPCIT sync as part of the s390_iommu_map_pages(). This would hide what
really is our variant of .iotlb_sync_map in the mapping code though
which I don't super like. Besides that it would also cause more RPCITs
in __iommu_map_sg() as we could no longer use a single RPCIT for the
entire range.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-12-02 14:29           ` Niklas Schnelle
@ 2022-12-02 14:42             ` Jason Gunthorpe
  2022-12-02 15:12               ` Niklas Schnelle
  2022-12-05 18:24             ` Robin Murphy
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-12-02 14:42 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On Fri, Dec 02, 2022 at 03:29:50PM +0100, Niklas Schnelle wrote:

> @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> can return and error and then failing the mapping operation I think
> this is a great approach. One advantage over the previous approach of
> flushing the queue isthat this should work for the pure IOMMU API too.

I think it is pretty important that the "pure" IOMMU API work with
whatever your solution its, the iommu_domain implementation in a
driver should not be sensitive to if the dma-iommu.c is operating the
domain or something else.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-12-02 14:42             ` Jason Gunthorpe
@ 2022-12-02 15:12               ` Niklas Schnelle
  2022-12-02 15:24                 ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-12-02 15:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On Fri, 2022-12-02 at 10:42 -0400, Jason Gunthorpe wrote:
> On Fri, Dec 02, 2022 at 03:29:50PM +0100, Niklas Schnelle wrote:
> 
> > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> > can return and error and then failing the mapping operation I think
> > this is a great approach. One advantage over the previous approach of
> > flushing the queue isthat this should work for the pure IOMMU API too.
> 
> I think it is pretty important that the "pure" IOMMU API work with
> whatever your solution its, the iommu_domain implementation in a
> driver should not be sensitive to if the dma-iommu.c is operating the
> domain or something else.
> 
> Jason

Agree. At the moment, i.e. with current mainline and even with the
IOMMU improvements in Joerg's queue, the IOMMU APIdoesn't work when the
hypervisor returns out-of-resource from the IOTLB flush (RPCIT). This
is currently only handled correctly in the arch/s390/pci/pci_dma.c DMA
API implementation. I think both handling this with a hidden/inlined
.iotlb_sync_map in s390_iommu_map_pages() or with an enhanced
.iotlb_sync_map that then must be able to return an error will fix this
shortcoming.  The latter would be something like this commit I'm
currently testing with:

https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/commit/?h=dma_iommu_v3&id=a2aecd821fe3c5e2549946a68d8b07e05b288a9b

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-12-02 15:12               ` Niklas Schnelle
@ 2022-12-02 15:24                 ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2022-12-02 15:24 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On Fri, Dec 02, 2022 at 04:12:50PM +0100, Niklas Schnelle wrote:

> https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/commit/?h=dma_iommu_v3&id=a2aecd821fe3c5e2549946a68d8b07e05b288a9b

This patch makes sense to me, if the sync-map is optimally a
hypervisor call and the hypervisor is allowed to fail it, then
propagating the failure seems necessary.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-11-29 20:09                   ` Jason Gunthorpe
  2022-11-30  1:28                     ` Baolu Lu
@ 2022-12-05 15:34                     ` Niklas Schnelle
  2022-12-06 23:09                       ` Jason Gunthorpe
  1 sibling, 1 reply; 39+ messages in thread
From: Niklas Schnelle @ 2022-12-05 15:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Robin Murphy
  Cc: Baolu Lu, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On Tue, 2022-11-29 at 16:09 -0400, Jason Gunthorpe wrote:
> On Tue, Nov 29, 2022 at 06:41:22PM +0000, Robin Murphy wrote:
> > On 2022-11-29 17:33, Jason Gunthorpe wrote:
> > > On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:
> > > 
> > > > I'm hardly an advocate for trying to save users from themselves, but I
> > > > honestly can't see any justifiable reason for not having sysfs respect
> > > > iommu_get_def_domain_type().
> > > 
> > > We really need to rename this value if it is not actually just an
> > > advisory "default" but a functional requirement ..
> > 
> > It represents a required default domain type. As in, the type for the
> > device's default domain. Not the default type for a domain. It's the
> > iommu_def_domain_type variable that holds the *default* default domain type
> > ;)
> 
> I find the name "default domain" incredibly confusing at this point in
> time.

Yes it definitely confused me as evidenced by this patch.

> 
> I would like to call that the "dma-api domain" - its primary purpose
> is to be the domain that the DMA API uses to operate the IOMMU, there
> is little "default" about it. This meshes better with our apis talking
> about ownership and so forth.
> 
> So, if the op was called
>   get_dma_api_domain_type()
> 
> It is pretty clear that it is the exact type of domain that should be
> created to support the DMA API, which is what I think you have been
> describing it is supposed to do?
> 
> And with Lu's series we have the set_platform_dma() (Lu perhaps you
> should call this set_platform_dma_api() to re-enforce it is about the
> DMA API, not some nebulous DMA thing)
> 
> Which is basically the other way to configure the DMA API for
> operation.
> 
> And encapsulating more of the logic to setup and manage the DMA API's
> domain into dma-iommu.c would also be helpful to understanding.
> 
> > Which reminds me I should finish that patch undoing my terrible
> > ops->default_domain_ops idea, not least because they are misleadingly
> > unrelated to default domains...
> 
> :)
> 
> > > It is close to being clear, once we get the last touches of dma-iommu
> > > stuff out of the drivers it should be quite clear
> > 
> > Cool, some upheaval of .domain_alloc is next on my hitlist anyway, so that
> > might be a good excuse to upheave it a bit more and streamline the type
> > stuff along the way.
> 
> Yes, I think so. I want to tidy things a bit so adding this "user
> space" domain concept is a little nicer
> 
> Jason

Ok, so it sounds to me like there is going to be quite a bit of change
in this area. Thus I'm a little unsure however how to proceed here. I
think especially with the proposed multiple queue types in this series
it makes sense for an IOMMU driver to express its preference of a
particular domain type if it supports multiple which clearly isn't what
iommu_get_def_domain_type() is currently supposed to do.

Looking at the current code I don't see a trivial way to integrate this
especially with a lot of reworks going on.

At the moment the preference for IOMMU_DOMAIN_DMA_FQ over
IOMMU_DOMAIN_DMA during initial boot is hard coded in
iommu_subsys_init() before we even know what IOMMU driver will be used.
so can't really access its ops there. On the other hand this decision
may still be rolled back if iommu_dma_init_fq() fails in
iommu_dma_init_domain() so maybe it would make sense to assume a DMA
domain type of IOMMU_DOMAIN_DMA until that point and only then prefer
IOMMU_DOMAIN_DMA_FQ or something else if the driver has a preference.

Maybe it would also make sense not to mix the type of flush queue used
with the domain and maybe the queue type could be passed in via
iommu_setup_dma_ops() -> iommu_dma_init_domain() though I do remember
that there is also a planned rework for that. Any suggestions?

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-12-02 14:29           ` Niklas Schnelle
  2022-12-02 14:42             ` Jason Gunthorpe
@ 2022-12-05 18:24             ` Robin Murphy
  2022-12-06 10:13               ` Niklas Schnelle
  1 sibling, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-12-05 18:24 UTC (permalink / raw)
  To: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On 2022-12-02 14:29, Niklas Schnelle wrote:
> On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
>> On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
>>> On 2022-11-29 12:00, Niklas Schnelle wrote:
>>>> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
>>>>> On 2022-11-16 17:16, Niklas Schnelle wrote:
>>>>>> When RPCIT indicates that the underlying hypervisor has run out of
>>>>>> resources it often means that its IOVA space is exhausted and IOVAs need
>>>>>> to be freed before new ones can be created. By triggering a flush of the
>>>>>> IOVA queue we can get the queued IOVAs freed and also get the new
>>>>>> mapping established during the global flush.
>>>>>
>>>>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
>>>>> exhausted and fail the DMA API call before even getting as far as
>>>>> iommu_map(), though? Or is there some less obvious limitation like a
>>>>> maximum total number of distinct IOVA regions regardless of size?
>>>>
>>>> Well, yes and no. Your thinking is of course correct if the advertised
>>>> available IOVA space can be fully utilized without exhausting
>>>> hypervisor resources we won't trigger this case. However sadly there
>>>> are complications. The most obvious being that in QEMU/KVM the
>>>> restriction of the IOVA space to what QEMU can actually have mapped at
>>>> once was just added recently[0] prior to that we would regularly go
>>>> through this "I'm out of resources free me some IOVAs" dance with our
>>>> existing DMA API implementation where this just triggers an early cycle
>>>> of freeing all unused IOVAs followed by a global flush. On z/VM I know
>>>> of no situations where this is triggered. That said this signalling is
>>>> architected so z/VM may have corner cases where it does this. On our
>>>> bare metal hypervisor (no paging) this return code is unused and IOTLB
>>>> flushes are simply hardware cache flushes as on bare metal platforms.
>>>>
>>>> [0]
>>>> https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/
>>>
>>> That sheds a bit more light, thanks, although I'm still not confident I
>>> fully understand the whole setup. AFAICS that patch looks to me like
>>> it's putting a fixed limit on the size of the usable address space. That
>>> in turn implies that "free some IOVAs and try again" might be a red
>>> herring and never going to work; for your current implementation, what
>>> that presumably means in reality is "free some IOVAs, resetting the
>>> allocator to start allocating lower down in the address space where it
>>> will happen to be below that limit, and try again", but the iommu-dma
>>> allocator won't do that. If it doesn't know that some arbitrary range
>>> below the top of the driver-advertised aperture is unusable, it will
>>> just keep allocating IOVAs up there and mappings will always fail.
>>>
>>> If the driver can't accurately represent the usable IOVA space via the
>>> aperture and/or reserved regions, then this whole approach seems doomed.
>>> If on the other hand I've misunderstood and you can actually still use
>>> any address, just not all of them at the same time,
>>
>>
>> This is exactly it, the problem is a limit on the number of IOVAs that
>> are concurrently mapped. In QEMU pass-through the tightest limit is
>> usually the one set by the host kernel parameter
>> vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
>> IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
>> there is a flush queue (including the existing per-CPU one) where each
>> entry may keep many pages lazily unmapped this is easly hit with fio
>> bandwidth tests on an NVMe. For this case this patch works reliably
>> because of course the number of actually active mappings without the
>> lazily freed ones is similar to the number of active ones with
>> IOMMU_DOMAIN_DMA.
>>
>>>   then it might in
>>> fact be considerably easier to skip the flush queue mechanism entirely
>>> and implement this internally to the driver - basically make .iotlb_sync
>>> a no-op for non-strict DMA domains,
>>
>> I'm assuming you mean .iotlb_sync_map above.

No, I did mean .iotlb_sync, however on reflection that was under the 
assumption that it's OK for the hypervisor to see a new mapping for a 
previously-used IOVA without having seen it explicitly unmapped in 
between. Fair enough if that isn't the case, but if it is then your 
pagetable can essentially act as the "flush queue" by itself.

>>>   put the corresponding RPCIT flush
>>> and retry in .sync_map, then allow that to propagate an error back to
>>> iommu_map() if the new mapping still hasn't taken.
>>>
>>> Thanks,
>>> Robin.
>>
>> Hmm, interesting. This would leave the IOVAs in the flush queue lazily
>> unmapped and thus still block their re-use but free their host
>> resources via a global RPCIT allowing the guest to use a different
>> porition of the IOVA space with those resources. It could work, though
>> I need to test it, but it feels a bit clunky.
>>
>> Maybe we can go cleaner while using this idea of not having to flush
>> the queue but just freeing their host side resources. If we allowed
>> .iotlb_sync_map to return an error that fails the mapping operation,
>> then we could do it all in there. In the normal case it just does the
>> RPCIT but if that returns that the hypervisor ran out of resources it
>> does another global RPCIT allowing the hypervisor to free IOVAs that
>> were lazily unmapped. If the latter succeeds all is good if not then
>> the mapping operation failed. Logically it makes sense too,
>> .iotlb_sync_map is the final step of syncing the mapping to the host
>> which can fail just like the mapping operation itself.
>>
>> Apart from the out of active IOVAs case this would also handle the
>> other useful error case when using .iotlb_sync_map for shadowing where
>> it fails because the host ran against a pinned pages limit or out of
>> actual memory. Not by fixing it but at least we would get a failed
>> mapping operation.
>>
>> The other callbacks .flush_iotlb_all and .iotlb_sync
>> could stay the same as they are only used for unmapped pages where we
>> can't reasonably run out of resources in the host neither active IOVAs
>> nor pinned pages.
>>
> 
> Ok, I've done some testing with the above idea and this seems to work
> great. I've verified that my version of QEMU (without Matt's IOVA
> aperture resrtriction patch) creates the RPCIT out of resource
> indications and then the global flush in .iotlb_sync_map is triggered
> and allows QEMU to unpin pages and free IOVAs while the guest still has
> them lazily unpapeg (sitting in the flush queue) and thus uses
> different IOVAs.
> 
> @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> can return and error and then failing the mapping operation I think
> this is a great approach. One advantage over the previous approach of
> flushing the queue isthat this should work for the pure IOMMU API too.

Whatever happens I think allowing .iotlb_sync_map to propagate an error 
out through iommu_map() is an appropriate thing to do - it sounds like 
s390 might technically need that for regular IOMMU API correctness in 
some circumstances anyway. Besides, even in the cases where it 
represents "simple" TLB maintenance, there are potentially ways that 
could fail (like timing out if the IOMMU has gone completely wrong), so 
it wouldn't seem entirely unreasonable if a driver might want to report 
overall failure if it can't guarantee that the new mapping will actually 
be usable.

Thanks,
Robin.

> If you don't want to change the signature of .iotlb_sync_map I think we
> can do Robin's idea and have .iotlb_sync_map as a no-op and do the
> RPCIT sync as part of the s390_iommu_map_pages(). This would hide what
> really is our variant of .iotlb_sync_map in the mapping code though
> which I don't super like. Besides that it would also cause more RPCITs
> in __iommu_map_sg() as we could no longer use a single RPCIT for the
> entire range.
> 
> Thanks,
> Niklas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
  2022-12-05 18:24             ` Robin Murphy
@ 2022-12-06 10:13               ` Niklas Schnelle
  0 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-12-06 10:13 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Mon, 2022-12-05 at 18:24 +0000, Robin Murphy wrote:
> On 2022-12-02 14:29, Niklas Schnelle wrote:
> > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
> > > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> > > > On 2022-11-29 12:00, Niklas Schnelle wrote:
> > > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > > > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > > > > to be freed before new ones can be created. By triggering a flush of the
> > > > > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > > > > mapping established during the global flush.
> > > > > > 
> > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > > > > exhausted and fail the DMA API call before even getting as far as
> > > > > > iommu_map(), though? Or is there some less obvious limitation like a
> > > > > > maximum total number of distinct IOVA regions regardless of size?
> > > > > 
> > > > > Well, yes and no. Your thinking is of course correct if the advertised
> > > > > available IOVA space can be fully utilized without exhausting
> > > > > hypervisor resources we won't trigger this case. However sadly there
> > > > > are complications. The most obvious being that in QEMU/KVM the
> > > > > restriction of the IOVA space to what QEMU can actually have mapped at
> > > > > once was just added recently[0] prior to that we would regularly go
> > > > > through this "I'm out of resources free me some IOVAs" dance with our
> > > > > existing DMA API implementation where this just triggers an early cycle
> > > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > > > > of no situations where this is triggered. That said this signalling is
> > > > > architected so z/VM may have corner cases where it does this. On our
> > > > > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > > > > flushes are simply hardware cache flushes as on bare metal platforms.
> > > > > 
> > > > > [0]
> > > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@linux.ibm.com/
> > > > 
> > > > That sheds a bit more light, thanks, although I'm still not confident I
> > > > fully understand the whole setup. AFAICS that patch looks to me like
> > > > it's putting a fixed limit on the size of the usable address space. That
> > > > in turn implies that "free some IOVAs and try again" might be a red
> > > > herring and never going to work; for your current implementation, what
> > > > that presumably means in reality is "free some IOVAs, resetting the
> > > > allocator to start allocating lower down in the address space where it
> > > > will happen to be below that limit, and try again", but the iommu-dma
> > > > allocator won't do that. If it doesn't know that some arbitrary range
> > > > below the top of the driver-advertised aperture is unusable, it will
> > > > just keep allocating IOVAs up there and mappings will always fail.
> > > > 
> > > > If the driver can't accurately represent the usable IOVA space via the
> > > > aperture and/or reserved regions, then this whole approach seems doomed.
> > > > If on the other hand I've misunderstood and you can actually still use
> > > > any address, just not all of them at the same time,
> > > 
> > > 
> > > This is exactly it, the problem is a limit on the number of IOVAs that
> > > are concurrently mapped. In QEMU pass-through the tightest limit is
> > > usually the one set by the host kernel parameter
> > > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
> > > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
> > > there is a flush queue (including the existing per-CPU one) where each
> > > entry may keep many pages lazily unmapped this is easly hit with fio
> > > bandwidth tests on an NVMe. For this case this patch works reliably
> > > because of course the number of actually active mappings without the
> > > lazily freed ones is similar to the number of active ones with
> > > IOMMU_DOMAIN_DMA.
> > > 
> > > >   then it might in
> > > > fact be considerably easier to skip the flush queue mechanism entirely
> > > > and implement this internally to the driver - basically make .iotlb_sync
> > > > a no-op for non-strict DMA domains,
> > > 
> > > I'm assuming you mean .iotlb_sync_map above.
> 
> No, I did mean .iotlb_sync, however on reflection that was under the 
> assumption that it's OK for the hypervisor to see a new mapping for a 
> previously-used IOVA without having seen it explicitly unmapped in 
> between. Fair enough if that isn't the case, but if it is then your 
> pagetable can essentially act as the "flush queue" by itself.

Hmm, I see. We do want the hypervisor to see mappings as invalid before
they are changed again to a new valid mapping. In case of e.g. QEMU/KVM
this allows the hypervisor to itself rely on the hardware to fill in
the IOTLB on map i.e. use a no-op .iotlb_sync_map and a .iotlb_sync
that flushes the hardware IOTLB. Also this allows QEMU to emulate the
IOMMU with just map/unmap primitives on vfio/iommufd. Consequently, I
recently found that vfio-pci pass-through works in combination with an
emulated s390 guest on an x86_64 host albeit very slowly of course.

> 
> > > >   put the corresponding RPCIT flush
> > > > and retry in .sync_map, then allow that to propagate an error back to
> > > > iommu_map() if the new mapping still hasn't taken.
> > > > 
> > > > Thanks,
> > > > Robin.
> > > 
> > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily
> > > unmapped and thus still block their re-use but free their host
> > > resources via a global RPCIT allowing the guest to use a different
> > > porition of the IOVA space with those resources. It could work, though
> > > I need to test it, but it feels a bit clunky.
> > > 
> > > Maybe we can go cleaner while using this idea of not having to flush
> > > the queue but just freeing their host side resources. If we allowed
> > > .iotlb_sync_map to return an error that fails the mapping operation,
> > > then we could do it all in there. In the normal case it just does the
> > > RPCIT but if that returns that the hypervisor ran out of resources it
> > > does another global RPCIT allowing the hypervisor to free IOVAs that
> > > were lazily unmapped. If the latter succeeds all is good if not then
> > > the mapping operation failed. Logically it makes sense too,
> > > .iotlb_sync_map is the final step of syncing the mapping to the host
> > > which can fail just like the mapping operation itself.
> > > 
> > > Apart from the out of active IOVAs case this would also handle the
> > > other useful error case when using .iotlb_sync_map for shadowing where
> > > it fails because the host ran against a pinned pages limit or out of
> > > actual memory. Not by fixing it but at least we would get a failed
> > > mapping operation.
> > > 
> > > The other callbacks .flush_iotlb_all and .iotlb_sync
> > > could stay the same as they are only used for unmapped pages where we
> > > can't reasonably run out of resources in the host neither active IOVAs
> > > nor pinned pages.
> > > 
> > 
> > Ok, I've done some testing with the above idea and this seems to work
> > great. I've verified that my version of QEMU (without Matt's IOVA
> > aperture resrtriction patch) creates the RPCIT out of resource
> > indications and then the global flush in .iotlb_sync_map is triggered
> > and allows QEMU to unpin pages and free IOVAs while the guest still has
> > them lazily unpapeg (sitting in the flush queue) and thus uses
> > different IOVAs.
> > 
> > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
> > can return and error and then failing the mapping operation I think
> > this is a great approach. One advantage over the previous approach of
> > flushing the queue isthat this should work for the pure IOMMU API too.
> 
> Whatever happens I think allowing .iotlb_sync_map to propagate an error 
> out through iommu_map() is an appropriate thing to do - it sounds like 
> s390 might technically need that for regular IOMMU API correctness in 
> some circumstances anyway. Besides, even in the cases where it 
> represents "simple" TLB maintenance, there are potentially ways that 
> could fail (like timing out if the IOMMU has gone completely wrong), so 
> it wouldn't seem entirely unreasonable if a driver might want to report 
> overall failure if it can't guarantee that the new mapping will actually 
> be usable.
> 
> Thanks,
> Robin.
> 

Great then I'll use this approach for v3 or maybe even sent this
separately as a prerequisite IOMMU fix as yes this is also required for
the IOMMU API to work in a guest where the hypervisor may report
running out of resources.
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-12-05 15:34                     ` Niklas Schnelle
@ 2022-12-06 23:09                       ` Jason Gunthorpe
  2022-12-07 13:18                         ` Baolu Lu
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-12-06 23:09 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Robin Murphy, Baolu Lu, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Mon, Dec 05, 2022 at 04:34:08PM +0100, Niklas Schnelle wrote:
> On Tue, 2022-11-29 at 16:09 -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 29, 2022 at 06:41:22PM +0000, Robin Murphy wrote:
> > > On 2022-11-29 17:33, Jason Gunthorpe wrote:
> > > > On Mon, Nov 28, 2022 at 09:01:43PM +0000, Robin Murphy wrote:
> > > > 
> > > > > I'm hardly an advocate for trying to save users from themselves, but I
> > > > > honestly can't see any justifiable reason for not having sysfs respect
> > > > > iommu_get_def_domain_type().
> > > > 
> > > > We really need to rename this value if it is not actually just an
> > > > advisory "default" but a functional requirement ..
> > > 
> > > It represents a required default domain type. As in, the type for the
> > > device's default domain. Not the default type for a domain. It's the
> > > iommu_def_domain_type variable that holds the *default* default domain type
> > > ;)
> > 
> > I find the name "default domain" incredibly confusing at this point in
> > time.
> 
> Yes it definitely confused me as evidenced by this patch.

I did some fiddling what this could rename could look like and came
up with this very sketchy sketch. I think it looks a lot clearer in
the end but it seems a bit tricky to break things up into nice patches.

Jason

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index beb9f535d3d815..052caf21fee3dd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -34,7 +34,12 @@
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
 
-static unsigned int iommu_def_domain_type __read_mostly;
+static enum dma_api_policy {
+	DMA_API_POLICY_IDENTITY,
+	DMA_API_POLICY_STRICT,
+	DMA_API_POLICY_LAZY,
+	DMA_API_POLICY_ANY,
+} dma_api_default_policy __read_mostly;
 static bool iommu_dma_strict __read_mostly = IS_ENABLED(CONFIG_IOMMU_DEFAULT_DMA_STRICT);
 static u32 iommu_cmd_line __read_mostly;
 
@@ -82,7 +87,7 @@ static const char * const iommu_group_resv_type_string[] = {
 static int iommu_bus_notifier(struct notifier_block *nb,
 			      unsigned long action, void *data);
 static int iommu_alloc_default_domain(struct iommu_group *group,
-				      struct device *dev);
+				      enum dma_api_policy policy);
 static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus,
 						 unsigned type);
 static int __iommu_attach_device(struct iommu_domain *domain,
@@ -97,6 +102,9 @@ static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
 
+static enum dma_api_policy
+iommu_get_dma_api_mandatory_policy(struct device *dev);
+
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
 	__ATTR(_name, _mode, _show, _store)
@@ -125,26 +133,6 @@ static struct bus_type * const iommu_buses[] = {
 #endif
 };
 
-/*
- * Use a function instead of an array here because the domain-type is a
- * bit-field, so an array would waste memory.
- */
-static const char *iommu_domain_type_str(unsigned int t)
-{
-	switch (t) {
-	case IOMMU_DOMAIN_BLOCKED:
-		return "Blocked";
-	case IOMMU_DOMAIN_IDENTITY:
-		return "Passthrough";
-	case IOMMU_DOMAIN_UNMANAGED:
-		return "Unmanaged";
-	case IOMMU_DOMAIN_DMA:
-		return "Translated";
-	default:
-		return "Unknown";
-	}
-}
-
 static int __init iommu_subsys_init(void)
 {
 	struct notifier_block *nb;
@@ -161,19 +149,27 @@ static int __init iommu_subsys_init(void)
 		}
 	}
 
-	if (!iommu_default_passthrough() && !iommu_dma_strict)
-		iommu_def_domain_type = IOMMU_DOMAIN_DMA_FQ;
+	if (dma_api_default_policy == DMA_API_POLICY_LAZY && iommu_dma_strict)
+		dma_api_default_policy = DMA_API_POLICY_STRICT;
 
-	pr_info("Default domain type: %s %s\n",
-		iommu_domain_type_str(iommu_def_domain_type),
-		(iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ?
-			"(set via kernel command line)" : "");
-
-	if (!iommu_default_passthrough())
-		pr_info("DMA domain TLB invalidation policy: %s mode %s\n",
+	switch (dma_api_default_policy) {
+	case DMA_API_POLICY_LAZY:
+	case DMA_API_POLICY_STRICT:
+		pr_info("DMA translated domain TLB invalidation policy: %s mode %s\n",
 			iommu_dma_strict ? "strict" : "lazy",
 			(iommu_cmd_line & IOMMU_CMD_LINE_STRICT) ?
-				"(set via kernel command line)" : "");
+				"(set via kernel command line)" :
+				"");
+		break;
+	case DMA_API_POLICY_IDENTITY:
+		pr_info("Default domain type: Passthrough %s\n",
+			(iommu_cmd_line & IOMMU_CMD_LINE_DMA_API) ?
+				"(set via kernel command line)" :
+				"");
+		break;
+	default:
+		break;
+	}
 
 	nb = kcalloc(ARRAY_SIZE(iommu_buses), sizeof(*nb), GFP_KERNEL);
 	if (!nb)
@@ -353,7 +349,8 @@ int iommu_probe_device(struct device *dev)
 	 * checked.
 	 */
 	mutex_lock(&group->mutex);
-	iommu_alloc_default_domain(group, dev);
+	iommu_alloc_default_domain(group,
+				   iommu_get_dma_api_mandatory_policy(dev));
 
 	/*
 	 * If device joined an existing group which has been claimed, don't
@@ -436,8 +433,8 @@ early_param("iommu.strict", iommu_dma_setup);
 void iommu_set_dma_strict(void)
 {
 	iommu_dma_strict = true;
-	if (iommu_def_domain_type == IOMMU_DOMAIN_DMA_FQ)
-		iommu_def_domain_type = IOMMU_DOMAIN_DMA;
+	if (dma_api_default_policy == DMA_API_POLICY_LAZY)
+		dma_api_default_policy = DMA_API_POLICY_STRICT;
 }
 
 static ssize_t iommu_group_attr_show(struct kobject *kobj,
@@ -1557,53 +1554,100 @@ struct iommu_group *fsl_mc_device_group(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
-static int iommu_get_def_domain_type(struct device *dev)
+static enum dma_api_policy
+iommu_get_dma_api_mandatory_policy(struct device *dev)
 {
 	const struct iommu_ops *ops = dev_iommu_ops(dev);
 
 	if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
-		return IOMMU_DOMAIN_DMA;
+		return DMA_API_POLICY_STRICT;
 
 	if (ops->def_domain_type)
 		return ops->def_domain_type(dev);
-
-	return 0;
+	return DMA_API_POLICY_ANY;
 }
 
-static int iommu_group_alloc_default_domain(struct bus_type *bus,
-					    struct iommu_group *group,
-					    unsigned int type)
+static int __iommu_get_dma_api_group_mandatory_policy(struct device *dev,
+						      void *data)
 {
-	struct iommu_domain *dom;
+	enum dma_api_policy *policy = data;
+	enum dma_api_policy dev_policy =
+		iommu_get_dma_api_mandatory_policy(dev);
 
-	dom = __iommu_domain_alloc(bus, type);
-	if (!dom && type != IOMMU_DOMAIN_DMA) {
-		dom = __iommu_domain_alloc(bus, IOMMU_DOMAIN_DMA);
-		if (dom)
-			pr_warn("Failed to allocate default IOMMU domain of type %u for group %s - Falling back to IOMMU_DOMAIN_DMA",
-				type, group->name);
+	if (dev_policy == DMA_API_POLICY_ANY || *policy == dev_policy)
+		return 0;
+	if (*policy == DMA_API_POLICY_ANY) {
+		*policy = dev_policy;
+		return 0;
 	}
 
-	if (!dom)
-		return -ENOMEM;
+	dev_warn(
+		dev,
+		"IOMMU driver is requesting different DMA API policies for devices in the same group %s.",
+		dev->iommu_group->name);
 
-	group->default_domain = dom;
-	if (!group->domain)
-		group->domain = dom;
+	/*
+	 * Resolve the conflict by priority, the lower numbers in the enum are
+	 * higher preference in case of driver problems.
+	 */
+	if (dev_policy < *policy)
+		*policy = dev_policy;
 	return 0;
 }
 
+static enum dma_api_policy
+iommu_get_dma_api_group_mandatory_policy(struct iommu_group *group)
+{
+	enum dma_api_policy policy = DMA_API_POLICY_ANY;
+
+	__iommu_group_for_each_dev(group, &policy,
+				   __iommu_get_dma_api_group_mandatory_policy);
+	return policy;
+}
+
 static int iommu_alloc_default_domain(struct iommu_group *group,
-				      struct device *dev)
+				      enum dma_api_policy policy)
 {
-	unsigned int type;
+	struct iommu_domain *domain;
+	struct group_device *grp_dev =
+		list_first_entry(&group->devices, struct group_device, list);
+	struct bus_type *bus = grp_dev->dev->bus;
+
+	lockdep_assert_held(group->mutex);
 
 	if (group->default_domain)
 		return 0;
+	if (policy == DMA_API_POLICY_ANY)
+		policy = dma_api_default_policy;
+	switch (policy) {
+		case DMA_API_POLICY_IDENTITY:
+		domain = __iommu_domain_alloc(bus, IOMMU_DOMAIN_IDENTITY);
+		break;
+
+		case DMA_API_POLICY_LAZY:
+		domain = __iommu_domain_alloc(bus, IOMMU_DOMAIN_DMA);
+		if (domain && !domain->ops->prefer_dma_api_fq){
+			pr_warn("Failed to allocate default lazy IOMMU domain for group %s - Falling back to strict",
+				group->name);
+		}
+		break;
 
-	type = iommu_get_def_domain_type(dev) ? : iommu_def_domain_type;
+		case DMA_API_POLICY_STRICT:
+		domain = __iommu_domain_alloc(bus, IOMMU_DOMAIN_DMA);
+		break;
 
-	return iommu_group_alloc_default_domain(dev->bus, group, type);
+		default:
+		WARN_ON(true);
+		domain = NULL;
+	}
+
+	if (!domain)
+		return -ENOMEM;
+
+	group->default_domain = domain;
+	if (!group->domain)
+		group->domain = domain;
+	return 0;
 }
 
 /**
@@ -1688,52 +1732,6 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 	return 0;
 }
 
-struct __group_domain_type {
-	struct device *dev;
-	unsigned int type;
-};
-
-static int probe_get_default_domain_type(struct device *dev, void *data)
-{
-	struct __group_domain_type *gtype = data;
-	unsigned int type = iommu_get_def_domain_type(dev);
-
-	if (type) {
-		if (gtype->type && gtype->type != type) {
-			dev_warn(dev, "Device needs domain type %s, but device %s in the same iommu group requires type %s - using default\n",
-				 iommu_domain_type_str(type),
-				 dev_name(gtype->dev),
-				 iommu_domain_type_str(gtype->type));
-			gtype->type = 0;
-		}
-
-		if (!gtype->dev) {
-			gtype->dev  = dev;
-			gtype->type = type;
-		}
-	}
-
-	return 0;
-}
-
-static void probe_alloc_default_domain(struct bus_type *bus,
-				       struct iommu_group *group)
-{
-	struct __group_domain_type gtype;
-
-	memset(&gtype, 0, sizeof(gtype));
-
-	/* Ask for default domain requirements of all devices in the group */
-	__iommu_group_for_each_dev(group, &gtype,
-				   probe_get_default_domain_type);
-
-	if (!gtype.type)
-		gtype.type = iommu_def_domain_type;
-
-	iommu_group_alloc_default_domain(bus, group, gtype.type);
-
-}
-
 static int iommu_group_do_dma_attach(struct device *dev, void *data)
 {
 	struct iommu_domain *domain = data;
@@ -1804,7 +1802,8 @@ int bus_iommu_probe(struct bus_type *bus)
 		mutex_lock(&group->mutex);
 
 		/* Try to allocate default domain */
-		probe_alloc_default_domain(bus, group);
+		iommu_alloc_default_domain(
+			group, iommu_get_dma_api_group_mandatory_policy(group));
 
 		if (!group->default_domain) {
 			mutex_unlock(&group->mutex);
@@ -2600,19 +2599,19 @@ void iommu_set_default_passthrough(bool cmd_line)
 {
 	if (cmd_line)
 		iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
-	iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
+	dma_api_default_policy = DMA_API_POLICY_IDENTITY;
 }
 
 void iommu_set_default_translated(bool cmd_line)
 {
 	if (cmd_line)
 		iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
-	iommu_def_domain_type = IOMMU_DOMAIN_DMA;
+	dma_api_default_policy = DMA_API_POLICY_STRICT;
 }
 
 bool iommu_default_passthrough(void)
 {
-	return iommu_def_domain_type == IOMMU_DOMAIN_IDENTITY;
+	return dma_api_default_policy == DMA_API_POLICY_IDENTITY;
 }
 EXPORT_SYMBOL_GPL(iommu_default_passthrough);
 
@@ -2832,103 +2831,40 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  *    group's default domain type through /sys/kernel/iommu_groups/<grp_id>/type
  *    Please take a closer look if intended to use for other purposes.
  */
-static int iommu_change_dev_def_domain(struct iommu_group *group,
-				       struct device *prev_dev, int type)
+static int iommu_change_group_dma_api_policy(struct iommu_group *group,
+					     enum dma_api_policy policy)
 {
-	struct iommu_domain *prev_dom;
-	struct group_device *grp_dev;
-	int ret, dev_def_dom;
-	struct device *dev;
-
-	mutex_lock(&group->mutex);
-
-	if (group->default_domain != group->domain) {
-		dev_err_ratelimited(prev_dev, "Group not assigned to default domain\n");
-		ret = -EBUSY;
-		goto out;
-	}
-
-	/*
-	 * iommu group wasn't locked while acquiring device lock in
-	 * iommu_group_store_type(). So, make sure that the device count hasn't
-	 * changed while acquiring device lock.
-	 *
-	 * Changing default domain of an iommu group with two or more devices
-	 * isn't supported because there could be a potential deadlock. Consider
-	 * the following scenario. T1 is trying to acquire device locks of all
-	 * the devices in the group and before it could acquire all of them,
-	 * there could be another thread T2 (from different sub-system and use
-	 * case) that has already acquired some of the device locks and might be
-	 * waiting for T1 to release other device locks.
-	 */
-	if (iommu_group_device_count(group) != 1) {
-		dev_err_ratelimited(prev_dev, "Cannot change default domain: Group has more than one device\n");
-		ret = -EINVAL;
-		goto out;
-	}
-
-	/* Since group has only one device */
-	grp_dev = list_first_entry(&group->devices, struct group_device, list);
-	dev = grp_dev->dev;
-
-	if (prev_dev != dev) {
-		dev_err_ratelimited(prev_dev, "Cannot change default domain: Device has been changed\n");
-		ret = -EBUSY;
-		goto out;
-	}
+	enum dma_api_policy mandatory =
+		iommu_get_dma_api_group_mandatory_policy(group);
+	struct iommu_domain *old_domain;
+	int ret;
 
-	prev_dom = group->default_domain;
-	if (!prev_dom) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (mandatory != DMA_API_POLICY_ANY && policy != mandatory)
+		return -EINVAL;
 
-	dev_def_dom = iommu_get_def_domain_type(dev);
-	if (!type) {
-		/*
-		 * If the user hasn't requested any specific type of domain and
-		 * if the device supports both the domains, then default to the
-		 * domain the device was booted with
-		 */
-		type = dev_def_dom ? : iommu_def_domain_type;
-	} else if (dev_def_dom && type != dev_def_dom) {
-		dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n",
-				    iommu_domain_type_str(type));
-		ret = -EINVAL;
-		goto out;
-	}
+	mutex_lock(&group->mutex);
 
-	/*
-	 * Switch to a new domain only if the requested domain type is different
-	 * from the existing default domain type
-	 */
-	if (prev_dom->type == type) {
-		ret = 0;
-		goto out;
+	/* Shortcut if FQ is being enabled */
+	if (group->default_domain->type == IOMMU_DOMAIN_DMA &&
+	    policy == DMA_API_POLICY_LAZY) {
+		ret = iommu_dma_init_fq(group->default_domain);
+		if (!ret) {
+			group->default_domain->type = IOMMU_DOMAIN_DMA_FQ;
+			mutex_unlock(&group->mutex);
+			return 0;
+		}
 	}
 
-	/* We can bring up a flush queue without tearing down the domain */
-	if (type == IOMMU_DOMAIN_DMA_FQ && prev_dom->type == IOMMU_DOMAIN_DMA) {
-		ret = iommu_dma_init_fq(prev_dom);
-		if (!ret)
-			prev_dom->type = IOMMU_DOMAIN_DMA_FQ;
-		goto out;
+	/* Otherwise just replace the whole domain */
+	old_domain = group->default_domain;
+	group->default_domain = NULL;
+	ret = iommu_alloc_default_domain(group, policy);
+	if (ret) {
+		group->default_domain = old_domain;
+		mutex_unlock(&group->mutex);
+		return ret;
 	}
-
-	/* Sets group->default_domain to the newly allocated domain */
-	ret = iommu_group_alloc_default_domain(dev->bus, group, type);
-	if (ret)
-		goto out;
-
-	ret = iommu_create_device_direct_mappings(group, dev);
-	if (ret)
-		goto free_new_domain;
-
-	ret = __iommu_attach_device(group->default_domain, dev);
-	if (ret)
-		goto free_new_domain;
-
-	group->domain = group->default_domain;
+	iommu_group_create_direct_mappings(group);
 
 	/*
 	 * Release the mutex here because ops->probe_finalize() call-back of
@@ -2938,20 +2874,15 @@ static int iommu_change_dev_def_domain(struct iommu_group *group,
 	 */
 	mutex_unlock(&group->mutex);
 
+	/*
+	 * FIXME: How does iommu_setup_dma_ops() get called with the new domain
+	 * on ARM?
+	 */
+
 	/* Make sure dma_ops is appropriatley set */
-	iommu_group_do_probe_finalize(dev, group->default_domain);
-	iommu_domain_free(prev_dom);
+	__iommu_group_dma_finalize(group);
+	iommu_domain_free(old_domain);
 	return 0;
-
-free_new_domain:
-	iommu_domain_free(group->default_domain);
-	group->default_domain = prev_dom;
-	group->domain = prev_dom;
-
-out:
-	mutex_unlock(&group->mutex);
-
-	return ret;
 }
 
 /*
@@ -2966,9 +2897,8 @@ static int iommu_change_dev_def_domain(struct iommu_group *group,
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count)
 {
-	struct group_device *grp_dev;
-	struct device *dev;
-	int ret, req_type;
+	enum dma_api_policy policy;
+	int ret;
 
 	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
 		return -EACCES;
@@ -2977,77 +2907,30 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 		return -EINVAL;
 
 	if (sysfs_streq(buf, "identity"))
-		req_type = IOMMU_DOMAIN_IDENTITY;
+		policy = DMA_API_POLICY_IDENTITY;
 	else if (sysfs_streq(buf, "DMA"))
-		req_type = IOMMU_DOMAIN_DMA;
+		policy = DMA_API_POLICY_STRICT;
 	else if (sysfs_streq(buf, "DMA-FQ"))
-		req_type = IOMMU_DOMAIN_DMA_FQ;
+		policy = DMA_API_POLICY_LAZY;
 	else if (sysfs_streq(buf, "auto"))
-		req_type = 0;
+		policy = DMA_API_POLICY_ANY;
 	else
 		return -EINVAL;
 
 	/*
-	 * Lock/Unlock the group mutex here before device lock to
-	 * 1. Make sure that the iommu group has only one device (this is a
-	 *    prerequisite for step 2)
-	 * 2. Get struct *dev which is needed to lock device
-	 */
-	mutex_lock(&group->mutex);
-	if (iommu_group_device_count(group) != 1) {
-		mutex_unlock(&group->mutex);
-		pr_err_ratelimited("Cannot change default domain: Group has more than one device\n");
-		return -EINVAL;
-	}
-
-	/* Since group has only one device */
-	grp_dev = list_first_entry(&group->devices, struct group_device, list);
-	dev = grp_dev->dev;
-	get_device(dev);
-
-	/*
-	 * Don't hold the group mutex because taking group mutex first and then
-	 * the device lock could potentially cause a deadlock as below. Assume
-	 * two threads T1 and T2. T1 is trying to change default domain of an
-	 * iommu group and T2 is trying to hot unplug a device or release [1] VF
-	 * of a PCIe device which is in the same iommu group. T1 takes group
-	 * mutex and before it could take device lock assume T2 has taken device
-	 * lock and is yet to take group mutex. Now, both the threads will be
-	 * waiting for the other thread to release lock. Below, lock order was
-	 * suggested.
-	 * device_lock(dev);
-	 *	mutex_lock(&group->mutex);
-	 *		iommu_change_dev_def_domain();
-	 *	mutex_unlock(&group->mutex);
-	 * device_unlock(dev);
-	 *
-	 * [1] Typical device release path
-	 * device_lock() from device/driver core code
-	 *  -> bus_notifier()
-	 *   -> iommu_bus_notifier()
-	 *    -> iommu_release_device()
-	 *     -> ops->release_device() vendor driver calls back iommu core code
-	 *      -> mutex_lock() from iommu core code
+	 * Taking ownership disables the DMA API domain, prevents drivers from
+	 * being attached, and switches to a blocking domain. Releasing
+	 * ownership will put back the new or original DMA API domain.
 	 */
-	mutex_unlock(&group->mutex);
-
-	/* Check if the device in the group still has a driver bound to it */
-	device_lock(dev);
-	if (device_is_bound(dev) && !(req_type == IOMMU_DOMAIN_DMA_FQ &&
-	    group->default_domain->type == IOMMU_DOMAIN_DMA)) {
-		pr_err_ratelimited("Device is still bound to driver\n");
-		ret = -EBUSY;
-		goto out;
-	}
-
-	ret = iommu_change_dev_def_domain(group, dev, req_type);
-	ret = ret ?: count;
-
-out:
-	device_unlock(dev);
-	put_device(dev);
+	ret = iommu_group_claim_dma_owner(group, &ret);
+	if (ret)
+		return ret;
 
-	return ret;
+	ret = iommu_change_group_dma_api_policy(group, policy);
+	iommu_group_release_dma_owner(group);
+	if (ret)
+		return ret;
+	return count;
 }
 
 static bool iommu_is_default_domain(struct iommu_group *group)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-12-06 23:09                       ` Jason Gunthorpe
@ 2022-12-07 13:18                         ` Baolu Lu
  2022-12-07 13:23                           ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Baolu Lu @ 2022-12-07 13:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Niklas Schnelle
  Cc: baolu.lu, Robin Murphy, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On 2022/12/7 7:09, Jason Gunthorpe wrote:
>   static ssize_t iommu_group_store_type(struct iommu_group *group,
>   				      const char *buf, size_t count)
>   {
> -	struct group_device *grp_dev;
> -	struct device *dev;
> -	int ret, req_type;
> +	enum dma_api_policy policy;
> +	int ret;
>   
>   	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
>   		return -EACCES;
> @@ -2977,77 +2907,30 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
>   		return -EINVAL;
>   
>   	if (sysfs_streq(buf, "identity"))
> -		req_type = IOMMU_DOMAIN_IDENTITY;
> +		policy = DMA_API_POLICY_IDENTITY;
>   	else if (sysfs_streq(buf, "DMA"))
> -		req_type = IOMMU_DOMAIN_DMA;
> +		policy = DMA_API_POLICY_STRICT;
>   	else if (sysfs_streq(buf, "DMA-FQ"))
> -		req_type = IOMMU_DOMAIN_DMA_FQ;
> +		policy = DMA_API_POLICY_LAZY;
>   	else if (sysfs_streq(buf, "auto"))
> -		req_type = 0;
> +		policy = DMA_API_POLICY_ANY;
>   	else
>   		return -EINVAL;
>   
>   	/*
> -	 * Lock/Unlock the group mutex here before device lock to
> -	 * 1. Make sure that the iommu group has only one device (this is a
> -	 *    prerequisite for step 2)
> -	 * 2. Get struct *dev which is needed to lock device
> -	 */
> -	mutex_lock(&group->mutex);
> -	if (iommu_group_device_count(group) != 1) {
> -		mutex_unlock(&group->mutex);
> -		pr_err_ratelimited("Cannot change default domain: Group has more than one device\n");
> -		return -EINVAL;
> -	}
> -
> -	/* Since group has only one device */
> -	grp_dev = list_first_entry(&group->devices, struct group_device, list);
> -	dev = grp_dev->dev;
> -	get_device(dev);
> -
> -	/*
> -	 * Don't hold the group mutex because taking group mutex first and then
> -	 * the device lock could potentially cause a deadlock as below. Assume
> -	 * two threads T1 and T2. T1 is trying to change default domain of an
> -	 * iommu group and T2 is trying to hot unplug a device or release [1] VF
> -	 * of a PCIe device which is in the same iommu group. T1 takes group
> -	 * mutex and before it could take device lock assume T2 has taken device
> -	 * lock and is yet to take group mutex. Now, both the threads will be
> -	 * waiting for the other thread to release lock. Below, lock order was
> -	 * suggested.
> -	 * device_lock(dev);
> -	 *	mutex_lock(&group->mutex);
> -	 *		iommu_change_dev_def_domain();
> -	 *	mutex_unlock(&group->mutex);
> -	 * device_unlock(dev);
> -	 *
> -	 * [1] Typical device release path
> -	 * device_lock() from device/driver core code
> -	 *  -> bus_notifier()
> -	 *   -> iommu_bus_notifier()
> -	 *    -> iommu_release_device()
> -	 *     -> ops->release_device() vendor driver calls back iommu core code
> -	 *      -> mutex_lock() from iommu core code
> +	 * Taking ownership disables the DMA API domain, prevents drivers from
> +	 * being attached, and switches to a blocking domain. Releasing
> +	 * ownership will put back the new or original DMA API domain.
>   	 */
> -	mutex_unlock(&group->mutex);
> -
> -	/* Check if the device in the group still has a driver bound to it */
> -	device_lock(dev);

With device_lock() removed, this probably races with the
iommu_release_device() path? group->mutex seems insufficient to avoid
the race. Perhaps I missed anything.

> -	if (device_is_bound(dev) && !(req_type == IOMMU_DOMAIN_DMA_FQ &&
> -	    group->default_domain->type == IOMMU_DOMAIN_DMA)) {
> -		pr_err_ratelimited("Device is still bound to driver\n");
> -		ret = -EBUSY;
> -		goto out;
> -	}
> -
> -	ret = iommu_change_dev_def_domain(group, dev, req_type);
> -	ret = ret ?: count;
> -
> -out:
> -	device_unlock(dev);
> -	put_device(dev);
> +	ret = iommu_group_claim_dma_owner(group, &ret);
> +	if (ret)
> +		return ret;
>   
> -	return ret;
> +	ret = iommu_change_group_dma_api_policy(group, policy);
> +	iommu_group_release_dma_owner(group);
> +	if (ret)
> +		return ret;
> +	return count;
>   }
>   
>   static bool iommu_is_default_domain(struct iommu_group *group)

Best regards,
baolu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-12-07 13:18                         ` Baolu Lu
@ 2022-12-07 13:23                           ` Jason Gunthorpe
  2022-12-07 14:18                             ` Robin Murphy
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 13:23 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Niklas Schnelle, Robin Murphy, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Wed, Dec 07, 2022 at 09:18:19PM +0800, Baolu Lu wrote:

> > -	/* Check if the device in the group still has a driver bound to it */
> > -	device_lock(dev);
> 
> With device_lock() removed, this probably races with the
> iommu_release_device() path? group->mutex seems insufficient to avoid
> the race. Perhaps I missed anything.

This path only deals with group, so there is no 'dev' and no race with
removal.

Later on we obtain the group mutex and then extract the first device
from the group list as a representative device of the group - eg to
perform iommu_domain allocation.

Under the group mutex devices on the device list cannot become
invalid.

It is the same reasoning we use in other places that iterate over the
group device list under lock.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-12-07 13:23                           ` Jason Gunthorpe
@ 2022-12-07 14:18                             ` Robin Murphy
  2022-12-07 14:30                               ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Robin Murphy @ 2022-12-07 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Baolu Lu
  Cc: Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Wenjia Zhang, Pierre Morel, linux-s390, borntraeger,
	hca, gor, gerald.schaefer, agordeev, svens, linux-kernel,
	Julian Ruess

On 2022-12-07 13:23, Jason Gunthorpe wrote:
> On Wed, Dec 07, 2022 at 09:18:19PM +0800, Baolu Lu wrote:
> 
>>> -	/* Check if the device in the group still has a driver bound to it */
>>> -	device_lock(dev);
>>
>> With device_lock() removed, this probably races with the
>> iommu_release_device() path? group->mutex seems insufficient to avoid
>> the race. Perhaps I missed anything.
> 
> This path only deals with group, so there is no 'dev' and no race with
> removal.

If we can now use the ownership mechanism to enforce the required 
constraints for change_dev_def_domain, that would be worthwhile (and a 
lot clearer) as a separate patch in its own right.

Thanks,
Robin.

> Later on we obtain the group mutex and then extract the first device
> from the group list as a representative device of the group - eg to
> perform iommu_domain allocation.
> 
> Under the group mutex devices on the device list cannot become
> invalid.
> 
> It is the same reasoning we use in other places that iterate over the
> group device list under lock.
> 
> Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type
  2022-12-07 14:18                             ` Robin Murphy
@ 2022-12-07 14:30                               ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2022-12-07 14:30 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Baolu Lu, Niklas Schnelle, Matthew Rosato, Gerd Bayer, iommu,
	Joerg Roedel, Will Deacon, Wenjia Zhang, Pierre Morel,
	linux-s390, borntraeger, hca, gor, gerald.schaefer, agordeev,
	svens, linux-kernel, Julian Ruess

On Wed, Dec 07, 2022 at 02:18:36PM +0000, Robin Murphy wrote:
> On 2022-12-07 13:23, Jason Gunthorpe wrote:
> > On Wed, Dec 07, 2022 at 09:18:19PM +0800, Baolu Lu wrote:
> > 
> > > > -	/* Check if the device in the group still has a driver bound to it */
> > > > -	device_lock(dev);
> > > 
> > > With device_lock() removed, this probably races with the
> > > iommu_release_device() path? group->mutex seems insufficient to avoid
> > > the race. Perhaps I missed anything.
> > 
> > This path only deals with group, so there is no 'dev' and no race with
> > removal.
> 
> If we can now use the ownership mechanism to enforce the required
> constraints for change_dev_def_domain, that would be worthwhile (and a lot
> clearer) as a separate patch in its own right.

Oh for sure, this is just a brain dump to share

I have a few other patches streamlining things in this file, just need time...

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] s390/pci: Use dma-iommu layer
  2022-11-28 18:03   ` Robin Murphy
@ 2022-12-19 15:17     ` Niklas Schnelle
  0 siblings, 0 replies; 39+ messages in thread
From: Niklas Schnelle @ 2022-12-19 15:17 UTC (permalink / raw)
  To: Robin Murphy, Matthew Rosato, Gerd Bayer, iommu, Joerg Roedel,
	Will Deacon, Jason Gunthorpe, Wenjia Zhang
  Cc: Pierre Morel, linux-s390, borntraeger, hca, gor, gerald.schaefer,
	agordeev, svens, linux-kernel, Julian Ruess

On Mon, 2022-11-28 at 18:03 +0000, Robin Murphy wrote:
> On 2022-11-16 17:16, Niklas Schnelle wrote:
> [...]
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index dc5f7a156ff5..804fb8f42d61 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -93,7 +93,7 @@ config IOMMU_DEBUGFS
> >   choice
> >   	prompt "IOMMU default domain type"
> >   	depends on IOMMU_API
> > -	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64
> > +	default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64 || S390
> >   	default IOMMU_DEFAULT_DMA_STRICT
> >   	help
> >   	  Choose the type of IOMMU domain used to manage DMA API usage by
> > @@ -412,6 +412,7 @@ config ARM_SMMU_V3_SVA
> >   config S390_IOMMU
> >   	def_bool y if S390 && PCI
> >   	depends on S390 && PCI
> > +	select IOMMU_DMA
> 
> Please add S390 to the condition under config IOMMU_DMA instead.

Done, will be in v3

> 
> >   	select IOMMU_API
> >   	help
> >   	  Support for the IOMMU API for s390 PCI devices.
> [...]
> > @@ -45,9 +329,14 @@ static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
> >   {
> >   	struct s390_domain *s390_domain;
> >   
> > -	if (domain_type != IOMMU_DOMAIN_UNMANAGED)
> > +	switch (domain_type) {
> > +	case IOMMU_DOMAIN_DMA:
> > +	case IOMMU_DOMAIN_DMA_FQ:
> 
> I was about to question adding this without any visible treatment of 
> iommu_iotlb_gather_queued() or domain->type == IOMMU_DOMAIN_DMA_FQ as a 
> whole, but I guess if you never actually free any pagetables it does 
> work out OK. Bit of a timebomb if there's a chance of that ever changing 
> in future, though.

Yes, as we're not freeing pagetables they will be reused simply by
reusing same IOVA. Restricting the range of IOVAs is then what keeps
the translation table from growing unbounded. This also makes the
pagetable handling easier 

It will definitely need handling if we ever add freeing pagestables
yes. I'd hope though that if we do add this functionality it will be
through common code reuse such as io-pgtable.h that will deal with this
naturally. I'm still pondering to add a comment here but I also fear
that this part is going to be refactored soon and such a comment will
get lost before it could matter.

> 
> > +	case IOMMU_DOMAIN_UNMANAGED:
> > +		break;
> > +	default:
> >   		return NULL;
> > -
> > +	}
> >   	s390_domain = kzalloc(sizeof(*s390_domain), GFP_KERNEL);
> >   	if (!s390_domain)
> >   		return NULL;
> [...]
> > @@ -397,12 +678,29 @@ static size_t s390_iommu_unmap_pages(struct iommu_domain *domain,
> >   		return 0;
> >   
> >   	iommu_iotlb_gather_add_range(gather, iova, size);
> > +	atomic64_add(pgcount, &s390_domain->ctrs.unmapped_pages);
> >   
> >   	return size;
> >   }
> >   
> > +static void s390_iommu_probe_finalize(struct device *dev)
> > +{
> > +	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> > +
> > +	iommu_dma_forcedac = true;
> > +	iommu_setup_dma_ops(dev, domain->geometry.aperture_start, domain->geometry.aperture_end);
> 
> For consistency with all but one other caller now, just pass 0 and 
> U64_MAX for base and size to make it clear that they're meaningless 
> (they will eventually go away as part of a bigger refactoring).
> 
> Thanks,
> Robin.

Done will be in v3.

Now I'm still trying to figure out how to prefer the proposed
IOMMU_DOMAIN_DMA_SQ when running with a hypervisor with expensive IOTLB
flush while IOMMU_DOMAIN_DMA and IOMMU_DOMAIN_DMA_FQ both work but are
slower. Jason's patch idea of adding a DMA_API_POLICY sounds like a
good start but is also a rather big change and doesn't yet allow for a
preference by the IOMMU driver. Will see, if I can come up with
something minimal for that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-12-19 15:18 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-16 17:16 [PATCH v2 0/7] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 1/7] s390/ism: Set DMA coherent mask Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 2/7] s390/pci: prepare is_passed_through() for dma-iommu Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 3/7] s390/pci: Use dma-iommu layer Niklas Schnelle
2022-11-28 18:03   ` Robin Murphy
2022-12-19 15:17     ` Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 4/7] iommu: Let iommu.strict override ops->def_domain_type Niklas Schnelle
2022-11-17  1:55   ` Baolu Lu
2022-11-28 11:10     ` Niklas Schnelle
2022-11-28 13:00       ` Baolu Lu
2022-11-28 13:29       ` Jason Gunthorpe
2022-11-28 15:54         ` Niklas Schnelle
2022-11-28 16:35           ` Jason Gunthorpe
2022-11-28 21:01             ` Robin Murphy
2022-11-29 17:33               ` Jason Gunthorpe
2022-11-29 18:41                 ` Robin Murphy
2022-11-29 20:09                   ` Jason Gunthorpe
2022-11-30  1:28                     ` Baolu Lu
2022-12-05 15:34                     ` Niklas Schnelle
2022-12-06 23:09                       ` Jason Gunthorpe
2022-12-07 13:18                         ` Baolu Lu
2022-12-07 13:23                           ` Jason Gunthorpe
2022-12-07 14:18                             ` Robin Murphy
2022-12-07 14:30                               ` Jason Gunthorpe
2022-11-28 16:56           ` Robin Murphy
2022-11-16 17:16 ` [PATCH v2 5/7] iommu/dma: Allow a single FQ in addition to per-CPU FQs Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 6/7] iommu/dma: Enable variable queue size and use larger single queue Niklas Schnelle
2022-11-16 17:16 ` [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication Niklas Schnelle
2022-11-28 14:52   ` Robin Murphy
2022-11-29 12:00     ` Niklas Schnelle
2022-11-29 12:53       ` Robin Murphy
2022-11-29 14:40         ` Niklas Schnelle
2022-12-02 14:29           ` Niklas Schnelle
2022-12-02 14:42             ` Jason Gunthorpe
2022-12-02 15:12               ` Niklas Schnelle
2022-12-02 15:24                 ` Jason Gunthorpe
2022-12-05 18:24             ` Robin Murphy
2022-12-06 10:13               ` Niklas Schnelle
2022-11-29 13:51       ` Matthew Rosato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).