linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows
@ 2014-09-23  3:00 Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size Alexey Kardashevskiy
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm


This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 1<<59.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 1<<59 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.

Please comment. Thanks!


Changes:
v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise :)



Alexey Kardashevskiy (13):
  powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
    take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  powerpc/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h            |  35 ++-
 arch/powerpc/include/asm/machdep.h          |  25 --
 arch/powerpc/include/asm/tce.h              |  37 +++
 arch/powerpc/kernel/iommu.c                 | 213 +++++++++------
 arch/powerpc/kernel/vio.c                   |   5 +-
 arch/powerpc/platforms/cell/iommu.c         |   9 +-
 arch/powerpc/platforms/pasemi/iommu.c       |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 233 +++++++++++++++--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c        | 113 +++++---
 arch/powerpc/platforms/powernv/pci.h        |  15 +-
 arch/powerpc/platforms/pseries/iommu.c      |  77 ++++--
 arch/powerpc/sysdev/dart_iommu.c            |  13 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 384 +++++++++++++++++++++++-----
 include/uapi/linux/vfio.h                   |  25 +-
 15 files changed, 925 insertions(+), 271 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a10642a..b378f78 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include <linux/pci.h>
 #include <linux/iommu.h>
 #include <linux/sched.h>
+#include <linux/hugetlb.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/iommu.h>
@@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 				tce, entry << tbl->it_page_shift, ret); */
 		return -EFAULT;
 	}
+
+	/*
+	 * Check that the TCE table granularity is not bigger than the size of
+	 * a page we just found. Otherwise the hardware can get access to
+	 * a bigger memory chunk that it should.
+	 */
+	if (PageHuge(page)) {
+		struct page *head = compound_head(page);
+		long shift = PAGE_SHIFT + compound_order(head);
+
+		if (shift < tbl->it_page_shift) {
+			ret = -EINVAL;
+			goto put_page_exit;
+		}
+
+	}
+
 	hwaddr = (unsigned long) page_address(page) + offset;
 
 	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
 	if (ret)
-		put_page(page);
+		goto put_page_exit;
 
-	if (ret < 0)
-		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+	return 0;
+
+put_page_exit:
+	pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
 			__func__, entry << tbl->it_page_shift, tce, ret);
 
+	put_page(page);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via
container_of(tbl). Since we are going to have to add Dynamic DMA windows
and that means having 2 IOMMU tables per PE, this is not going to work.

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* changed commit log to explain why this change is needed
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 ++++++++++++-------------------
 arch/powerpc/platforms/powernv/pci.c      | 31 ++++++++++++++++++++-------
 arch/powerpc/platforms/powernv/pci.h      | 13 +++++++++++-
 3 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index df241b1..136e765 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, &pe->tce32.table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       &pe->tce32.table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, &pe->tce32.table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 __be64 *startp, __be64 *endp, bool rm)
-{
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
-	struct pnv_phb *phb = pe->phb;
-
-	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate;
 
 	/* OPAL variant of P7IOC SW invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
@@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
 	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1193,9 +1182,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce32.invalidate_fn = pnv_pci_ioda2_tce_invalidate;
 
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b854b57..97895d4 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -599,6 +599,27 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
+	__be64 *endp, bool rm)
+{
+	struct pnv_iommu_table *ptbl = container_of(tbl,
+			struct pnv_iommu_table, table);
+	struct pnv_ioda_pe *pe = ptbl->pe;
+
+	/*
+	 * Some implementations won't cache invalid TCEs and thus may not
+	 * need that flush. We'll probably turn it_type into a bit mask
+	 * of flags if that becomes the case
+	 */
+	if (!(tbl->it_type & TCE_PCI_SWINV_FREE))
+		return;
+
+	if (!pe || !ptbl->invalidate_fn)
+		return;
+
+	ptbl->invalidate_fn(pe, tbl, startp, endp, rm);
+}
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
@@ -619,12 +640,7 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		*(tcep++) = cpu_to_be64(proto_tce |
 				(rpn++ << tbl->it_page_shift));
 
-	/* Some implementations won't cache invalid TCEs and thus may not
-	 * need that flush. We'll probably turn it_type into a bit mask
-	 * of flags if that becomes the case
-	 */
-	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 
 	return 0;
 }
@@ -648,8 +664,7 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
 	while (npages--)
 		*(tcep++) = cpu_to_be64(0);
 
-	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 }
 
 static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 48494d4..095db43 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -24,6 +24,17 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
 
+struct pnv_ioda_pe;
+typedef void (*pnv_invalidate_fn)(struct pnv_ioda_pe *pe,
+		struct iommu_table *tbl,
+		__be64 *startp, __be64 *endp, bool rm);
+
+struct pnv_iommu_table {
+	struct iommu_table	table;
+	struct pnv_ioda_pe	*pe;
+	pnv_invalidate_fn	invalidate_fn;
+};
+
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
 struct pnv_ioda_pe {
@@ -53,7 +64,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23 20:42   ` Alex Williamson
  2014-09-23  3:00 ` [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

This defines a get_table() callback which returns an iommu_table
by its number.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
This changes the tce_container struct to store iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
	- iommu_table;
	- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |   6 ++
 arch/powerpc/include/asm/tce.h              |  13 +++
 arch/powerpc/kernel/iommu.c                 |  35 ++++++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +++++-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c        |   2 +-
 arch/powerpc/platforms/pseries/iommu.c      |  10 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 148 ++++++++++++++++++++++------
 8 files changed, 208 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+				 void *iommu_owner,
+				 struct spapr_tce_iommu_ops *ops,
 				 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+					void *iommu_owner,
+					struct spapr_tce_iommu_ops *ops,
 					int pci_domain_number,
 					unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..9f159eb 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,18 @@
 #define TCE_PCI_READ		0x1		/* read from PCI allowed */
 #define TCE_VB_WRITE		0x1		/* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+struct spapr_tce_iommu_ops {
+	struct iommu_table *(*get_table)(
+			struct spapr_tce_iommu_group *data,
+			int num);
+};
+
+struct spapr_tce_iommu_group {
+	void *iommu_owner;
+	struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b378f78..1c5dae7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
  */
 static void group_release(void *iommu_data)
 {
-	struct iommu_table *tbl = iommu_data;
-	tbl->it_group = NULL;
+	kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_default_get_table(
+		struct spapr_tce_iommu_group *data, int num)
+{
+	struct iommu_table *tbl = data->iommu_owner;
+
+	switch (num) {
+	case 0:
+		if (tbl->it_size)
+			return tbl;
+		/* fallthru */
+	default:
+		return NULL;
+	}
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+	.get_table = spapr_tce_default_get_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+		void *iommu_owner, struct spapr_tce_iommu_ops *ops,
 		int pci_domain_number, unsigned long pe_num)
 {
 	struct iommu_group *grp;
 	char *name;
+	struct spapr_tce_iommu_group *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return;
+
+	data->iommu_owner = iommu_owner ? iommu_owner : tbl;
+	data->ops = ops ? ops : &spapr_tce_default_ops;
 
 	grp = iommu_group_alloc();
 	if (IS_ERR(grp)) {
 		pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
 				PTR_ERR(grp));
+		kfree(data);
 		return;
 	}
+
 	tbl->it_group = grp;
-	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu_group_set_iommudata(grp, data, group_release);
 	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			pci_domain_number, pe_num);
 	if (!name)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 136e765..2d32a1c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include <linux/io.h>
 #include <linux/msi.h>
 #include <linux/memblock.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -988,6 +989,26 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
+static struct iommu_table *pnv_ioda1_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		int num)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	switch (num) {
+	case 0:
+		if (pe->tce32.table.it_size)
+			return &pe->tce32.table;
+		/* fallthru */
+	default:
+		return NULL;
+	}
+}
+
+static struct spapr_tce_iommu_ops pnv_pci_ioda1_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -1067,7 +1088,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -1137,6 +1159,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
+static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -1202,7 +1228,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 94ce348..b79066d 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -89,6 +89,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
+				NULL, NULL,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 97895d4..6ffac79 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -723,7 +723,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	iommu_init_table(tbl, hose->node);
-	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 4642d6a..b95f8cf 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -616,7 +616,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 	iommu_table_setparms(pci->phb, dn, tbl);
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(tbl, pci_domain_nr(bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -661,7 +661,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(bus), 0);
+		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
 }
@@ -688,7 +688,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
-		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
 					       PCI_DN(dn)->iommu_table);
 		return;
@@ -1105,7 +1106,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
 	} else {
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 730b4ef..a8adfbd 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -43,17 +43,51 @@ static void tce_iommu_detach_group(void *iommu_data,
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_table *tbl;
+	struct iommu_group *grp;
+	long windows_num;
 	bool enabled;
 };
 
+static struct iommu_table *spapr_tce_find_table(
+		struct tce_container *container,
+		struct spapr_tce_iommu_group *data,
+		phys_addr_t ioba)
+{
+	long i;
+	struct iommu_table *ret = NULL;
+
+	mutex_lock(&container->lock);
+	for (i = 0; i < container->windows_num; ++i) {
+		struct iommu_table *tbl = data->ops->get_table(data, i);
+
+		if (tbl) {
+			unsigned long entry = ioba >> tbl->it_page_shift;
+			unsigned long start = tbl->it_offset;
+			unsigned long end = start + tbl->it_size;
+
+			if ((start <= entry) && (entry < end)) {
+				ret = tbl;
+				break;
+			}
+		}
+	}
+	mutex_unlock(&container->lock);
+
+	return ret;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
 	unsigned long locked, lock_limit, npages;
-	struct iommu_table *tbl = container->tbl;
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	if (!container->tbl)
+	if (!container->grp)
+		return -ENXIO;
+
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return -ENXIO;
 
 	if (!current->mm)
@@ -80,6 +114,10 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
 	 */
+	tbl = data->ops->get_table(data, 0);
+	if (!tbl)
+		return -ENXIO;
+
 	down_write(&current->mm->mmap_sem);
 	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	locked = current->mm->locked_vm + npages;
@@ -89,7 +127,6 @@ static int tce_iommu_enable(struct tce_container *container)
 				rlimit(RLIMIT_MEMLOCK));
 		ret = -ENOMEM;
 	} else {
-
 		current->mm->locked_vm += npages;
 		container->enabled = true;
 	}
@@ -100,16 +137,27 @@ static int tce_iommu_enable(struct tce_container *container)
 
 static void tce_iommu_disable(struct tce_container *container)
 {
+	struct spapr_tce_iommu_group *data;
+	struct iommu_table *tbl;
+
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->tbl || !current->mm)
+	if (!container->grp || !current->mm)
+		return;
+
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
+		return;
+
+	tbl = data->ops->get_table(data, 0);
+	if (!tbl)
 		return;
 
 	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (container->tbl->it_size <<
+	current->mm->locked_vm -= (tbl->it_size <<
 			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	up_write(&current->mm->mmap_sem);
 }
@@ -129,6 +177,8 @@ static void *tce_iommu_open(unsigned long arg)
 
 	mutex_init(&container->lock);
 
+	container->windows_num = 1;
+
 	return container;
 }
 
@@ -136,11 +186,11 @@ static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
 
-	WARN_ON(container->tbl && !container->tbl->it_group);
+	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->tbl && container->tbl->it_group)
-		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+	if (container->grp)
+		tce_iommu_detach_group(iommu_data, container->grp);
 
 	mutex_destroy(&container->lock);
 
@@ -169,8 +219,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, 0);
 		if (WARN_ON(!tbl))
 			return -ENXIO;
 
@@ -194,13 +253,16 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 		unsigned long tce, i;
 
-		if (!tbl)
+		if (WARN_ON(!container->grp))
 			return -ENXIO;
 
-		BUG_ON(!tbl->it_group);
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -225,6 +287,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
 			tce |= TCE_PCI_WRITE;
 
+		tbl = spapr_tce_find_table(container, data, param.iova);
+		if (!tbl)
+			return -ENXIO;
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_put_param_check(tbl, param.iova, tce);
 		if (ret)
 			return ret;
@@ -247,9 +314,14 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_UNMAP_DMA: {
 		struct vfio_iommu_type1_dma_unmap param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
@@ -268,6 +340,12 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.size & ~IOMMU_PAGE_MASK_4K)
 			return -EINVAL;
 
+		tbl = spapr_tce_find_table(container, data, param.iova);
+		if (!tbl)
+			return -ENXIO;
+
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
 				param.size >> IOMMU_PAGE_SHIFT_4K);
 		if (ret)
@@ -293,10 +371,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		mutex_unlock(&container->lock);
 		return 0;
 	case VFIO_EEH_PE_OP:
-		if (!container->tbl || !container->tbl->it_group)
+		if (!container->grp)
 			return -ENODEV;
 
-		return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
+		return vfio_spapr_iommu_eeh_ioctl(container->grp,
 						  cmd, arg);
 	}
 
@@ -308,16 +386,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 {
 	int ret;
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->tbl) {
+	if (container->grp) {
 		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(container->grp),
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else if (container->enabled) {
@@ -325,9 +403,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else {
+		data = iommu_group_get_iommudata(iommu_group);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, 0);
+		BUG_ON(!tbl);
+
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
-			container->tbl = tbl;
+			container->grp = iommu_group;
 	}
 
 	mutex_unlock(&container->lock);
@@ -339,24 +424,31 @@ static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
-	if (tbl != container->tbl) {
+	if (iommu_group != container->grp) {
 		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
 				iommu_group_id(iommu_group),
-				iommu_group_id(tbl->it_group));
+				iommu_group_id(container->grp));
 	} else {
 		if (container->enabled) {
 			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(tbl->it_group));
+					iommu_group_id(container->grp));
 			tce_iommu_disable(container);
 		}
 
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
-		container->tbl = NULL;
+		container->grp = NULL;
+
+		data = iommu_group_get_iommudata(iommu_group);
+		BUG_ON(!data || !data->iommu_owner || !data->ops);
+
+		tbl = data->ops->get_table(data, 0);
+		BUG_ON(!tbl);
+
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23 21:00   ` Alex Williamson
  2014-09-23  3:00 ` [PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |  1 -
 arch/powerpc/include/asm/tce.h            |  2 ++
 arch/powerpc/kernel/iommu.c               | 12 ------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 20 ++++++++++++--------
 drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
 5 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
-	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 9f159eb..e6355f9 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops {
 	struct iommu_table *(*get_table)(
 			struct spapr_tce_iommu_group *data,
 			int num);
+	void (*take_ownership)(struct spapr_tce_iommu_group *data,
+			bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1c5dae7..c2c8d9d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
-	/*
-	 * Disable iommu bypass, otherwise the user can DMA to all of
-	 * our physical memory via the bypass window instead of just
-	 * the pages that has been explicitly mapped into the iommu
-	 */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, false);
-
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
-
-	/* The kernel owns the device now, we can restore the iommu bypass */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d32a1c..8cb2f31 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 		 * host side.
 		 */
 		if (pe->pdev)
-			set_iommu_table_base(&pe->pdev->dev, tbl);
+			set_iommu_table_base(&pe->pdev->dev, &pe->tce32.table);
 		else
 			pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
 	}
@@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	/* Install set_bypass callback for VFIO */
-	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
+	pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+				     bool enable)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
 	.get_table = pnv_ioda1_iommu_get_table,
+	.take_ownership = pnv_ioda2_take_ownership,
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a8adfbd..1c1a9c4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -76,6 +76,13 @@ static struct iommu_table *spapr_tce_find_table(
 	return ret;
 }
 
+static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
+		bool enable)
+{
+	if (data && data->ops && data->ops->take_ownership)
+		data->ops->take_ownership(data, enable);
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
@@ -413,6 +420,12 @@ static int tce_iommu_attach_group(void *iommu_data,
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
 			container->grp = iommu_group;
+		/*
+		 * Disable iommu bypass, otherwise the user can DMA to all of
+		 * our physical memory via the bypass window instead of just
+		 * the pages that has been explicitly mapped into the iommu
+		 */
+		tce_iommu_take_ownership_notify(data, true);
 	}
 
 	mutex_unlock(&container->lock);
@@ -450,6 +463,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 		BUG_ON(!tbl);
 
 		iommu_release_ownership(tbl);
+
+		/* Kernel owns the device now, we can restore bypass */
+		tce_iommu_take_ownership_notify(data, false);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c2c8d9d..cd80867 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0, bit0 = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+		bit0 = test_and_clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (bit0)
+			set_bit(0, tbl->it_map);
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
-	return 0;
+	if (!ret)
+		iommu_clear_tces_and_put_pages(tbl, tbl->it_offset,
+				tbl->it_size);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
+
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds an extra @ops parameter to iommu_init_table() to make sure
that we do not leave any IOMMU table without iommu_table_ops. @it_ops is
initialized in the very beginning as iommu_init_table() calls
iommu_table_clear() and the latter uses callbacks already.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" prefixes
for better readability.

This removes tce_xxx_rm handlers from ppc_md as well but does not add
them to iommu_table_ops, this will be done later if we decide to support
TCE hypercalls in real mode.

This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as
callbacks for pseries. This changes "multi" callbacks to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            | 20 +++++++++++-
 arch/powerpc/include/asm/machdep.h          | 25 ---------------
 arch/powerpc/kernel/iommu.c                 | 50 ++++++++++++++++-------------
 arch/powerpc/kernel/vio.c                   |  5 ++-
 arch/powerpc/platforms/cell/iommu.c         |  9 ++++--
 arch/powerpc/platforms/pasemi/iommu.c       |  8 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 +--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 +-
 arch/powerpc/platforms/powernv/pci.c        | 24 ++++----------
 arch/powerpc/platforms/powernv/pci.h        |  1 +
 arch/powerpc/platforms/pseries/iommu.c      | 42 +++++++++++++-----------
 arch/powerpc/sysdev/dart_iommu.c            | 13 ++++----
 12 files changed, 102 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2b0b01d..c725e4a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+	int (*set)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
+	void (*clear)(struct iommu_table *tbl,
+			long index, long npages);
+	unsigned long (*get)(struct iommu_table *tbl, long index);
+	void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	struct iommu_table_ops *it_ops;
 };
 
 /* Pure 2^n version of get_order */
@@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  * structure
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-					    int nid);
+					    int nid,
+					    struct iommu_table_ops *ops);
 
 struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index b125cea..1fc824d 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
 
-	int		(*tce_build)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	unsigned long	(*tce_get)(struct iommu_table *tbl,
-				    long index);
-	void		(*tce_flush)(struct iommu_table *tbl);
-
-	/* _rm versions are for real mode use only */
-	int		(*tce_build_rm)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free_rm)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	void		(*tce_flush_rm)(struct iommu_table *tbl);
-
 	void __iomem *	(*ioremap)(phys_addr_t addr, unsigned long size,
 				   unsigned long flags, void *caller);
 	void		(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index cd80867..678fee8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -323,11 +323,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
-	build_fail = ppc_md.tce_build(tbl, entry, npages,
+	build_fail = tbl->it_ops->set(tbl, entry, npages,
 				      (unsigned long)page &
 				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
-	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	/* tbl->it_ops->set() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
 	 * DMA_ERROR_CODE. For all other errors the functionality is
 	 * not altered.
@@ -338,8 +338,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -409,7 +409,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	if (!iommu_free_check(tbl, dma_addr, npages))
 		return;
 
-	ppc_md.tce_free(tbl, entry, npages);
+	tbl->it_ops->clear(tbl, entry, npages);
 
 	spin_lock_irqsave(&(pool->lock), flags);
 	bitmap_clear(tbl->it_map, free_entry, npages);
@@ -425,8 +425,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	 * not do an mb() here on purpose, it is not needed on any of
 	 * the current platforms.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -496,7 +496,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		build_fail = ppc_md.tce_build(tbl, entry, npages,
+		build_fail = tbl->it_ops->set(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
 					      direction, attrs);
 		if(unlikely(build_fail))
@@ -535,8 +535,8 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	DBG("mapped %d elements:\n", outcount);
 
@@ -601,8 +601,8 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 	 * do not do an mb() here, the affected platforms do not need it
 	 * when freeing.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 static void iommu_table_clear(struct iommu_table *tbl)
@@ -614,17 +614,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
 	 */
 	if (!is_kdump_kernel() || is_fadump_active()) {
 		/* Clear the table in case firmware left allocations in it */
-		ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+		tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
 		return;
 	}
 
 #ifdef CONFIG_CRASH_DUMP
-	if (ppc_md.tce_get) {
+	if (tbl->it_ops->get) {
 		unsigned long index, tceval, tcecount = 0;
 
 		/* Reserve the existing mappings left by the first kernel. */
 		for (index = 0; index < tbl->it_size; index++) {
-			tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+			tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
 			/*
 			 * Freed TCE entry contains 0x7fffffffffffffff on JS20
 			 */
@@ -650,7 +650,8 @@ static void iommu_table_clear(struct iommu_table *tbl)
  * Build a iommu_table structure.  This contains a bit map which
  * is used to manage allocation of the tce space.
  */
-struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
+struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
+		struct iommu_table_ops *ops)
 {
 	unsigned long sz;
 	static int welcomed = 0;
@@ -658,6 +659,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	unsigned int i;
 	struct iommu_pool *p;
 
+	BUG_ON(!ops);
+	tbl->it_ops = ops;
+
 	/* number of bytes needed for the bitmap */
 	sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
@@ -949,8 +953,8 @@ EXPORT_SYMBOL_GPL(iommu_tce_direction);
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -961,7 +965,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce_value,
 		unsigned long npages)
 {
-	/* ppc_md.tce_free() does not support any value but 0 */
+	/* tbl->it_ops->clear() does not support any value but 0 */
 	if (tce_value)
 		return -EINVAL;
 
@@ -1009,9 +1013,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		ppc_md.tce_free(tbl, entry, 1);
+		tbl->it_ops->clear(tbl, entry, 1);
 	else
 		oldtce = 0;
 
@@ -1058,10 +1062,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	/* Add new entry if it is not busy */
 	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
 
 	spin_unlock(&(pool->lock));
 
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5bfdab9..c61ce7a 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,7 +1196,10 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
 
-	return iommu_init_table(tbl, -1);
+	return iommu_init_table(tbl, -1,
+			firmware_has_feature(FW_FEATURE_LPAR) ?
+			&iommu_table_lpar_multi_ops :
+			&iommu_table_pseries_ops);
 }
 
 /**
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 2b90ff8..f6254f7 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
 	return *ioid;
 }
 
+static struct iommu_table_ops cell_iommu_ops = {
+	.set = tce_build_cell,
+	.clear = tce_free_cell
+};
+
 static struct iommu_window * __init
 cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 			unsigned long offset, unsigned long size,
@@ -492,7 +497,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 		(offset >> window->table.it_page_shift) + pte_offset;
 	window->table.it_size = size >> window->table.it_page_shift;
 
-	iommu_init_table(&window->table, iommu->nid);
+	iommu_init_table(&window->table, iommu->nid, &cell_iommu_ops);
 
 	pr_debug("\tioid      %d\n", window->ioid);
 	pr_debug("\tblocksize %ld\n", window->table.it_blocksize);
@@ -1199,8 +1204,6 @@ static int __init cell_iommu_init(void)
 	/* Setup various ppc_md. callbacks */
 	ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
 	ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
-	ppc_md.tce_build = tce_build_cell;
-	ppc_md.tce_free = tce_free_cell;
 
 	if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
 		goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 2e576f2..eac33f4 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
 	}
 }
 
+static struct iommu_table_ops iommu_table_iobmap_ops = {
+	.set = iobmap_build,
+	.clear  = iobmap_free
+};
 
 static void iommu_table_iobmap_setup(void)
 {
@@ -151,7 +155,7 @@ static void iommu_table_iobmap_setup(void)
 	 * Should probably be 8 (64 bytes)
 	 */
 	iommu_table_iobmap.it_blocksize = 4;
-	iommu_init_table(&iommu_table_iobmap, 0);
+	iommu_init_table(&iommu_table_iobmap, 0, &iommu_table_iobmap_ops);
 	pr_debug(" <- %s\n", __func__);
 }
 
@@ -250,8 +254,6 @@ void __init iommu_init_early_pasemi(void)
 
 	ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
 	ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
-	ppc_md.tce_build = iobmap_build;
-	ppc_md.tce_free  = iobmap_free;
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8cb2f31..296f49b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1087,7 +1087,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
-	iommu_init_table(tbl, phb->hose->node);
+	iommu_init_table(tbl, phb->hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
 			phb->hose->global_number, pe->pe_number);
 
@@ -1231,7 +1231,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	iommu_init_table(tbl, phb->hose->node);
+	iommu_init_table(tbl, phb->hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
 			phb->hose->global_number, pe->pe_number);
 
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index b79066d..ea97414 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,7 +87,8 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
-		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
+		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node,
+				&pnv_iommu_ops);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
 				NULL, NULL,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 6ffac79..deddcad 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -677,18 +677,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
 }
 
-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
-{
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
-	pnv_tce_free(tbl, index, npages, true);
-}
+struct iommu_table_ops pnv_iommu_ops = {
+	.set = pnv_tce_build_vm,
+	.clear = pnv_tce_free_vm,
+	.get = pnv_tce_get,
+};
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
@@ -722,7 +715,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
-	iommu_init_table(tbl, hose->node);
+	iommu_init_table(tbl, hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
@@ -865,11 +858,6 @@ void __init pnv_pci_init(void)
 
 	/* Configure IOMMU DMA hooks */
 	ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
-	ppc_md.tce_build = pnv_tce_build_vm;
-	ppc_md.tce_free = pnv_tce_free_vm;
-	ppc_md.tce_build_rm = pnv_tce_build_rm;
-	ppc_md.tce_free_rm = pnv_tce_free_rm;
-	ppc_md.tce_get = pnv_tce_get;
 	ppc_md.pci_probe_mode = pnv_pci_probe_mode;
 	set_pci_dma_ops(&dma_iommu_ops);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 095db43..cf68c4b 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -223,6 +223,7 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
+extern struct iommu_table_ops pnv_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index b95f8cf..9a7364f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -193,7 +193,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	int ret = 0;
 	unsigned long flags;
 
-	if (npages == 1) {
+	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 		                           direction, attrs);
 	}
@@ -285,6 +285,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
 {
 	u64 rc;
 
+	if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+		return tce_free_pSeriesLP(tbl, tcenum, npages);
+
 	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
 
 	if (rc && printk_ratelimit()) {
@@ -460,7 +463,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
 	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
-
 #ifdef CONFIG_PCI
 static void iommu_table_setparms(struct pci_controller *phb,
 				 struct device_node *dn,
@@ -546,6 +548,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	tbl->it_size = size >> tbl->it_page_shift;
 }
 
+struct iommu_table_ops iommu_table_pseries_ops = {
+	.set = tce_build_pSeries,
+	.clear = tce_free_pSeries,
+	.get = tce_get_pseries
+};
+
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 {
 	struct device_node *dn;
@@ -615,7 +623,8 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 			   pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
-	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+	pci->iommu_table = iommu_init_table(tbl, pci->phb->node,
+			&iommu_table_pseries_ops);
 	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
@@ -626,6 +635,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+	.set = tce_buildmulti_pSeriesLP,
+	.clear = tce_freemulti_pSeriesLP,
+	.get = tce_get_pSeriesLP
+};
 
 static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 {
@@ -660,7 +674,8 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
-		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
+		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node,
+				&iommu_table_lpar_multi_ops);
 		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
@@ -687,7 +702,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
-		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
+		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node,
+				&iommu_table_pseries_ops);
 		iommu_register_group(tbl, NULL, NULL,
 				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
@@ -1105,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
-		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+		pci->iommu_table = iommu_init_table(tbl, pci->phb->node,
+				&iommu_table_lpar_multi_ops);
 		iommu_register_group(tbl, NULL, NULL,
 				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
@@ -1297,22 +1314,11 @@ void iommu_init_early_pSeries(void)
 		return;
 
 	if (firmware_has_feature(FW_FEATURE_LPAR)) {
-		if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
-			ppc_md.tce_build = tce_buildmulti_pSeriesLP;
-			ppc_md.tce_free	 = tce_freemulti_pSeriesLP;
-		} else {
-			ppc_md.tce_build = tce_build_pSeriesLP;
-			ppc_md.tce_free	 = tce_free_pSeriesLP;
-		}
-		ppc_md.tce_get   = tce_get_pSeriesLP;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
 		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
 		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
 	} else {
-		ppc_md.tce_build = tce_build_pSeries;
-		ppc_md.tce_free  = tce_free_pSeries;
-		ppc_md.tce_get   = tce_get_pseries;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
 	}
@@ -1330,8 +1336,6 @@ static int __init disable_multitce(char *str)
 	    firmware_has_feature(FW_FEATURE_LPAR) &&
 	    firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
-		ppc_md.tce_build = tce_build_pSeriesLP;
-		ppc_md.tce_free	 = tce_free_pSeriesLP;
 		powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
 	}
 	return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..27721f5 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
 	return 0;
 }
 
+static struct iommu_table_ops iommu_dart_ops = {
+	.set = dart_build,
+	.clear = dart_free,
+	.flush = dart_flush,
+};
+
 static void iommu_table_dart_setup(void)
 {
 	iommu_table_dart.it_busno = 0;
@@ -298,7 +304,7 @@ static void iommu_table_dart_setup(void)
 	iommu_table_dart.it_base = (unsigned long)dart_vbase;
 	iommu_table_dart.it_index = 0;
 	iommu_table_dart.it_blocksize = 1;
-	iommu_init_table(&iommu_table_dart, -1);
+	iommu_init_table(&iommu_table_dart, -1, &iommu_dart_ops);
 
 	/* Reserve the last page of the DART to avoid possible prefetch
 	 * past the DART mapped area
@@ -386,11 +392,6 @@ void __init iommu_init_early_dart(void)
 	if (dart_init(dn) != 0)
 		goto bail;
 
-	/* Setup low level TCE operations for the core IOMMU code */
-	ppc_md.tce_build = dart_build;
-	ppc_md.tce_free  = dart_free;
-	ppc_md.tce_flush = dart_flush;
-
 	/* Setup bypass if supported */
 	if (dart_is_u4)
 		ppc_md.dma_set_mask = dart_dma_set_mask;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 08/13] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index deddcad..ab79e2d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	__be64 *tcep, *tces;
 	u64 rpn;
 
-	proto_tce = TCE_PCI_READ; // Read allowed
-
-	if (direction != DMA_TO_DEVICE)
-		proto_tce |= TCE_PCI_WRITE;
+	switch (direction) {
+	case DMA_BIDIRECTIONAL:
+	case DMA_FROM_DEVICE:
+		proto_tce = TCE_PCI_READ | TCE_PCI_WRITE;
+		break;
+	case DMA_TO_DEVICE:
+		proto_tce = TCE_PCI_READ;
+		break;
+	default:
+		proto_tce = 0;
+		break;
+	}
 
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
 	rpn = __pa(uaddr) >> tbl->it_page_shift;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 08/13] powerpc/powernv: Release replaced TCE
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This makes iommu_tce_build() put pages returned by exchange().

This replaces iommu_clear_tce() with iommu_tce_build which now
can call exchange() with TCE==NULL (i.e. clear).

This preserves permission bits in TCE in iommu_put_tce_user_mode().

This removes use of pool locks for external IOMMU uses.

This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not
implement exchange() callback. Therefore the "powernv" platform is
the only supported one after this patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* added missing __pa() for TCE which was read from the table

---
 arch/powerpc/include/asm/iommu.h     |  8 +++--
 arch/powerpc/kernel/iommu.c          | 62 ++++++++++++------------------------
 arch/powerpc/platforms/powernv/pci.c | 40 +++++++++++++++++++++++
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c725e4a..8e0537d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
 			unsigned long uaddr,
 			enum dma_data_direction direction,
 			struct dma_attrs *attrs);
+	int (*exchange)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			unsigned long *old_tces,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
 	unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-		unsigned long entry);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 678fee8..39ccce7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
-
-	spin_lock(&(pool->lock));
-
-	oldtce = tbl->it_ops->get(tbl, entry);
-	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		tbl->it_ops->clear(tbl, entry, 1);
-	else
-		oldtce = 0;
-
-	spin_unlock(&(pool->lock));
-
-	return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages)
 {
-	unsigned long oldtce;
-	struct page *page;
-
 	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
-			continue;
-
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
+		iommu_tce_build(tbl, entry, 0, DMA_NONE);
 	}
 
 	return 0;
@@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction)
 {
-	int ret = -EBUSY;
+	int ret;
 	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
 
-	spin_lock(&(pool->lock));
+	ret = tbl->it_ops->exchange(tbl, entry, 1, hwaddr, &oldtce,
+			direction, NULL);
 
-	oldtce = tbl->it_ops->get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
+	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
+		struct page *pg = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
 
-	spin_unlock(&(pool->lock));
+		if (oldtce & TCE_PCI_WRITE)
+			SetPageDirty(pg);
+		put_page(pg);
+	}
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1111,6 +1080,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 	}
 
 	hwaddr = (unsigned long) page_address(page) + offset;
+	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
 
 	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
 	if (ret)
@@ -1133,6 +1103,16 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0, bit0 = 0;
 
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires exchange() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->exchange)
+		return -EINVAL;
+
 	spin_lock_irqsave(&tbl->large_pool.lock, flags);
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index ab79e2d..052e503 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,45 @@ static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
 			false);
 }
 
+static int pnv_tce_xchg_vm(struct iommu_table *tbl, long index,
+			   long npages,
+			   unsigned long uaddr, unsigned long *old_tces,
+			   enum dma_data_direction direction,
+			   struct dma_attrs *attrs)
+{
+	u64 rpn, proto_tce;
+	__be64 *tcep, *tces;
+	long i;
+
+	switch (direction) {
+	case DMA_BIDIRECTIONAL:
+	case DMA_FROM_DEVICE:
+		proto_tce = TCE_PCI_READ | TCE_PCI_WRITE;
+		break;
+	case DMA_TO_DEVICE:
+		proto_tce = TCE_PCI_READ;
+		break;
+	default:
+		proto_tce = 0;
+		break;
+	}
+
+	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+	rpn = __pa(uaddr) >> tbl->it_page_shift;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long oldtce = xchg(tcep, cpu_to_be64(proto_tce |
+				(rpn++ << tbl->it_page_shift)));
+
+		old_tces[i] = (unsigned long) __va(be64_to_cpu(oldtce));
+		tcep++;
+	}
+
+	pnv_tce_invalidate(tbl, tces, tcep - 1, false);
+
+	return 0;
+}
+
 static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
 		bool rm)
 {
@@ -687,6 +726,7 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 
 struct iommu_table_ops pnv_iommu_ops = {
 	.set = pnv_tce_build_vm,
+	.exchange = pnv_tce_xchg_vm,
 	.clear = pnv_tce_free_vm,
 	.get = pnv_tce_get,
 };
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 08/13] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:00 ` [PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchaange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9a7364f..ae15b5a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
+				unsigned long *old_tces,
 				enum dma_data_direction direction,
 				struct dma_attrs *attrs)
 {
 	u64 rc = 0;
 	u64 proto_tce, tce;
 	u64 rpn;
-	int ret = 0;
+	int ret = 0, i = 0;
 	long tcenum_start = tcenum, npages_start = npages;
 
 	rpn = __pa(uaddr) >> TCE_SHIFT;
@@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	while (npages--) {
 		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+		if (old_tces)
+			plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12,
+					&old_tces[i++]);
 		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
 		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs)
 {
@@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					   old_tces,
 		                           direction, attrs);
 	}
 
@@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					    old_tces,
 					    direction, attrs);
 		}
 		__get_cpu_var(tce_page) = tcep;
@@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		for (l = 0; l < limit; l++) {
 			tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
 			rpn++;
+			if (old_tces)
+				plpar_tce_get((u64)tbl->it_index,
+						(u64)(tcenum + l) << 12,
+						&old_tces[tcenum + l]);
 		}
 
 		rc = plpar_tce_put_indirect((u64)tbl->it_index,
@@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+				     long npages, unsigned long uaddr,
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs)
+{
+	return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+			direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
 {
 	u64 rc;
@@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
+	.exchange = tce_xchg_pSeriesLP,
 	.clear = tce_freemulti_pSeriesLP,
 	.get = tce_get_pSeriesLP
 };
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
@ 2014-09-23  3:00 ` Alexey Kardashevskiy
  2014-09-23  3:01 ` [PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

SPAPR defines an interface to create additional DMA windows dynamically.
"Dynamically" means that the window is not allocated before the guest
even started, the guest can request it later. In practice, existing linux
guests check for the capability and if it is there, they create and map
a DMA window as big as the entire guest RAM.

This adds 4 callbacks to the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);

2. create - ibm,create-pe-dma-window - creates a window;

3. remove - ibm,remove-pe-dma-window - removes a window; removing
the default 32bit window is not allowed by this patch, this will be added
later if needed;

4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; as the default window cannot be removed, it only
removes the additional window if it was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
provide necessary support to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/tce.h            |  22 +++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 159 +++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.h      |   1 +
 3 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index e6355f9..23b0362 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops {
 			int num);
 	void (*take_ownership)(struct spapr_tce_iommu_group *data,
 			bool enable);
+
+	/* Dynamic DMA window */
+	/* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K       0x01
+#define DDW_PGSIZE_64K      0x02
+#define DDW_PGSIZE_16M      0x04
+#define DDW_PGSIZE_32M      0x08
+#define DDW_PGSIZE_64M      0x10
+#define DDW_PGSIZE_128M     0x20
+#define DDW_PGSIZE_256M     0x40
+#define DDW_PGSIZE_16G      0x80
+	long (*query)(struct spapr_tce_iommu_group *data,
+			__u32 *current_windows,
+			__u32 *windows_available,
+			__u32 *page_size_mask);
+	long (*create)(struct spapr_tce_iommu_group *data,
+			__u32 page_shift,
+			__u32 window_shift,
+			struct iommu_table **ptbl);
+	long (*remove)(struct spapr_tce_iommu_group *data,
+			struct iommu_table *tbl);
+	long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 296f49b..a6318cb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		int num)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	switch (num) {
+	case 0:
+		if (pe->tce32.table.it_size)
+			return &pe->tce32.table;
+		return NULL;
+	case 1:
+		if (pe->tce64.table.it_size)
+			return &pe->tce64.table;
+		return NULL;
+	default:
+		return NULL;
+	}
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 				     bool enable)
 {
@@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+		__u32 *current_windows,
+		__u32 *windows_available, __u32 *page_size_mask)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	*windows_available = 2;
+	*current_windows = 0;
+	if (pe->tce32.table.it_size) {
+		--*windows_available;
+		++*current_windows;
+	}
+	if (pe->tce64.table.it_size) {
+		--*windows_available;
+		++*current_windows;
+	}
+	*page_size_mask =
+		DDW_PGSIZE_4K |
+		DDW_PGSIZE_64K |
+		DDW_PGSIZE_16M;
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+		__u32 page_shift, __u32 window_shift,
+		struct iommu_table **ptbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	struct page *tce_mem = NULL;
+	void *addr;
+	long ret;
+	unsigned long tce_table_size =
+			(1ULL << (window_shift - page_shift)) * 8;
+	unsigned order;
+	struct iommu_table *tbl64 = &pe->tce64.table;
+
+	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
+		return -EINVAL;
+
+	if (window_shift > (memory_hotplug_max() >> page_shift))
+		return -EINVAL;
+
+	if (pe->tce64.table.it_size && pe->tce32.table.it_size)
+		return -EBUSY;
+
+	tce_table_size = max(0x1000UL, tce_table_size);
+	order = get_order(tce_table_size);
+
+	pe_info(pe, "Setting up DDW at %llx..%llx ws=0x%x ps=0x%x table_size=0x%lx order=0x%x\n",
+			pe->tce_bypass_base,
+			pe->tce_bypass_base + (1ULL << window_shift) - 1,
+			window_shift, page_shift, tce_table_size, order);
+
+	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pe_err(pe, " Failed to allocate a DDW\n");
+		return -EFAULT;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, tce_table_size);
+
+	/* Configure HW */
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1, /* Window number */
+			1,
+			__pa(addr),
+			tce_table_size,
+			1 << page_shift);
+	if (ret) {
+		pe_err(pe, " Failed to configure 32-bit TCE table, err %ld\n",
+				ret);
+		return -EFAULT;
+	}
+
+	/* Setup linux iommu table */
+	pnv_pci_setup_iommu_table(tbl64, addr, tce_table_size,
+			pe->tce_bypass_base, page_shift);
+	pe->tce64.pe = pe;
+	pe->tce64.invalidate_fn = pnv_pci_ioda2_tce_invalidate;
+
+	/* Copy "invalidate" register address */
+	tbl64->it_index = pe->tce32.table.it_index;
+	tbl64->it_group = pe->tce32.table.it_group;
+	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
+			TCE_PCI_SWINV_PAIR;
+	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
+	tbl64->it_ops = pe->tce32.table.it_ops;
+
+	*ptbl = tbl64;
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_remove(struct spapr_tce_iommu_group *data,
+		struct iommu_table *tbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	long ret;
+
+	/* Only additional 64bit window removal is supported */
+	if ((tbl != &pe->tce64.table) || !pe->tce64.table.it_size)
+		return -EFAULT;
+
+	pe_info(pe, "Removing huge 64bit DMA window\n");
+
+	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+	memset(&pe->tce64, 0, sizeof(pe->tce64));
+
+	return ret;
+}
+
+static long pnv_pci_ioda2_ddw_reset(struct spapr_tce_iommu_group *data)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pe_info(pe, "Reset DMA windows\n");
+	return pnv_pci_ioda2_ddw_remove(data, &pe->tce64.table);
+}
+
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
-	.get_table = pnv_ioda1_iommu_get_table,
+	.get_table = pnv_ioda2_iommu_get_table,
 	.take_ownership = pnv_ioda2_take_ownership,
+	.query = pnv_pci_ioda2_ddw_query,
+	.create = pnv_pci_ioda2_ddw_create,
+	.remove = pnv_pci_ioda2_ddw_remove,
+	.reset = pnv_pci_ioda2_ddw_reset
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index cf68c4b..9941800 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -66,6 +66,7 @@ struct pnv_ioda_pe {
 	int			tce32_segcount;
 	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
+	struct pnv_iommu_table	tce64;
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2014-09-23  3:00 ` [PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
@ 2014-09-23  3:01 ` Alexey Kardashevskiy
  2014-09-23  3:01 ` [PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
  2014-09-23  3:01 ` [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:01 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 1c1a9c4..c9fac97 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
+static long try_increment_locked_vm(struct iommu_table *tbl)
+{
+	long ret = 0, locked, lock_limit, npages;
+
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
+	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+				rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+static void decrement_locked_vm(struct iommu_table *tbl)
+{
+	long npages;
+
+	if (!current || !current->mm)
+		return; /* process exited */
+
+	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+	if (npages > current->mm->locked_vm)
+		npages = current->mm->locked_vm;
+	current->mm->locked_vm -= npages;
+	up_write(&current->mm->mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked, lock_limit, npages;
 	struct iommu_table *tbl;
 	struct spapr_tce_iommu_group *data;
 
@@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
+	 *
+	 * Unfortunately at the moment it counts whole tables, no matter how
+	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+	 * each with 2GB DMA window, 8GB will be counted here. The reason for
+	 * this is that we cannot tell here the amount of RAM used by the guest
+	 * as this information is only available from KVM and VFIO is
+	 * KVM agnostic.
 	 */
 	tbl = data->ops->get_table(data, 0);
 	if (!tbl)
 		return -ENXIO;
 
-	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	locked = current->mm->locked_vm + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-				rlimit(RLIMIT_MEMLOCK));
-		ret = -ENOMEM;
-	} else {
-		current->mm->locked_vm += npages;
-		container->enabled = true;
-	}
-	up_write(&current->mm->mmap_sem);
+	ret = try_increment_locked_vm(tbl);
+	if (ret)
+		return ret;
+
+	container->enabled = true;
 
 	return ret;
 }
@@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!tbl)
 		return;
 
-	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (tbl->it_size <<
-			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	up_write(&current->mm->mmap_sem);
+	decrement_locked_vm(tbl);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2014-09-23  3:01 ` [PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
@ 2014-09-23  3:01 ` Alexey Kardashevskiy
  2014-09-23  3:01 ` [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
  12 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:01 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c9fac97..0dccbc4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl)
 	if (!current || !current->mm)
 		return -ESRCH; /* process exited */
 
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 
 	down_write(&current->mm->mmap_sem);
 	locked = current->mm->locked_vm + npages;
@@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl)
 	if (!current || !current->mm)
 		return; /* process exited */
 
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 
 	down_write(&current->mm->mmap_sem);
 	if (npages > current->mm->locked_vm)
@@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
-		if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
-				(param.vaddr & ~IOMMU_PAGE_MASK_4K))
-			return -EINVAL;
-
 		/* iova is checked by the IOMMU API */
 		tce = param.vaddr;
 		if (param.flags & VFIO_DMA_MAP_FLAG_READ)
@@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data,
 			return -ENXIO;
 		BUG_ON(!tbl->it_group);
 
+		if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+				(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
+			return -EINVAL;
+
 		ret = iommu_tce_put_param_check(tbl, param.iova, tce);
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
+		for (i = 0; i < (param.size >> tbl->it_page_shift); ++i) {
 			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
+					(param.iova >> tbl->it_page_shift) + i,
 					tce);
 			if (ret)
 				break;
-			tce += IOMMU_PAGE_SIZE_4K;
+			tce += IOMMU_PAGE_SIZE(tbl);
 		}
 		if (ret)
 			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+					param.iova >> tbl->it_page_shift, i);
 
 		iommu_flush_tce(tbl);
 
@@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
-		if (param.size & ~IOMMU_PAGE_MASK_4K)
-			return -EINVAL;
-
 		tbl = spapr_tce_find_table(container, data, param.iova);
 		if (!tbl)
 			return -ENXIO;
 
+		if (param.size & ~IOMMU_PAGE_MASK(tbl))
+			return -EINVAL;
+
 		BUG_ON(!tbl->it_group);
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.size >> tbl->it_page_shift);
 		if (ret)
 			return ret;
 
 		ret = iommu_clear_tces_and_put_pages(tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				param.size >> tbl->it_page_shift);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
  2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2014-09-23  3:01 ` [PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
@ 2014-09-23  3:01 ` Alexey Kardashevskiy
  2014-09-23 21:56   ` Alex Williamson
  12 siblings, 1 reply; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-23  3:01 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Alex Williamson,
	Gavin Shan, linux-kernel, kvm

This defines and implements VFIO IOMMU API which lets the userspace
create and remove DMA windows.

This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
available windows and page mask.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
to allow the user space to create and remove window(s).

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls platform DDW reset() callback when IOMMU is being disabled
to reset the DMA configuration to its original state.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++++++++++++++++++++++++++++++++++--
 include/uapi/linux/vfio.h           |  25 ++++++-
 2 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0dccbc4..b518891 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	container->enabled = false;
 
-	if (!container->grp || !current->mm)
+	if (!container->grp)
 		return;
 
 	data = iommu_group_get_iommudata(container->grp);
 	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return;
 
-	tbl = data->ops->get_table(data, 0);
-	if (!tbl)
-		return;
+	if (current->mm) {
+		tbl = data->ops->get_table(data, 0);
+		if (tbl)
+			decrement_locked_vm(tbl);
 
-	decrement_locked_vm(tbl);
+		tbl = data->ops->get_table(data, 1);
+		if (tbl)
+			decrement_locked_vm(tbl);
+	}
+
+	if (data->ops->reset)
+		data->ops->reset(data);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz;
+	unsigned long minsz, ddwsz;
 	long ret;
 
 	switch (cmd) {
@@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
 		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
+		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+				page_size_mask);
+
+		if (info.argsz == ddwsz) {
+			if (data->ops->query && data->ops->create &&
+					data->ops->remove) {
+				info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
+
+				ret = data->ops->query(data,
+						&info.current_windows,
+						&info.windows_available,
+						&info.page_size_mask);
+				if (ret)
+					return ret;
+			} else {
+				info.current_windows = 0;
+				info.windows_available = 0;
+				info.page_size_mask = 0;
+			}
+			minsz = ddwsz;
+		}
+
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
 
@@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
+
 	case VFIO_EEH_PE_OP:
 		if (!container->grp)
 			return -ENODEV;
 
 		return vfio_spapr_iommu_eeh_ioctl(container->grp,
 						  cmd, arg);
+
+	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+		struct vfio_iommu_spapr_tce_create create;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+				start_addr);
+
+		if (copy_from_user(&create, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (create.argsz < minsz)
+			return -EINVAL;
+
+		if (create.flags)
+			return -EINVAL;
+
+		if (!data->ops->create || !data->iommu_owner)
+			return -ENOSYS;
+
+		BUG_ON(!data || !data->ops || !data->ops->remove);
+
+		ret = data->ops->create(data, create.page_shift,
+				create.window_shift, &tbl);
+		if (ret)
+			return ret;
+
+		ret = try_increment_locked_vm(tbl);
+		if (ret) {
+			data->ops->remove(data, tbl);
+			return ret;
+		}
+
+		create.start_addr = tbl->it_offset << tbl->it_page_shift;
+
+		if (copy_to_user((void __user *)arg, &create, minsz)) {
+			data->ops->remove(data, tbl);
+			decrement_locked_vm(tbl);
+			return -EFAULT;
+		}
+		mutex_lock(&container->lock);
+		++container->windows_num;
+		mutex_unlock(&container->lock);
+
+		return ret;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+		struct vfio_iommu_spapr_tce_remove remove;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+				start_addr);
+
+		if (copy_from_user(&remove, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (remove.argsz < minsz)
+			return -EINVAL;
+
+		if (remove.flags)
+			return -EINVAL;
+
+		if (!data->ops->remove || !data->iommu_owner)
+			return -ENOSYS;
+
+		tbl = spapr_tce_find_table(container, data, remove.start_addr);
+		if (!tbl)
+			return -EINVAL;
+
+		ret = data->ops->remove(data, tbl);
+		if (ret)
+			return ret;
+
+		decrement_locked_vm(tbl);
+
+		mutex_lock(&container->lock);
+		--container->windows_num;
+		mutex_unlock(&container->lock);
+
+		return 0;
+	}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 6612974..e71a6ef 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -451,9 +451,13 @@ struct vfio_iommu_type1_dma_unmap {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	__u32 current_windows;
+	__u32 windows_available;
+	__u32 page_size_mask;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -489,6 +493,25 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  2014-09-23  3:00 ` [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
@ 2014-09-23 20:42   ` Alex Williamson
  2014-09-25  5:19     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2014-09-23 20:42 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Gavin Shan, linux-kernel, kvm

On Tue, 2014-09-23 at 13:00 +1000, Alexey Kardashevskiy wrote:
> Modern IBM POWERPC systems support multiple IOMMU tables per PE
> so we need a more reliable way (compared to container_of()) to get
> a PE pointer from the iommu_table struct pointer used in IOMMU functions.
> 
> At the moment IOMMU group data points to an iommu_table struct. This
> introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
> and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
> the pnv_ioda_pe struct, for others it is still a pointer to
> the iommu_table struct. The ops structs correspond to the type which
> iommu_owner points to.
> 
> This defines a get_table() callback which returns an iommu_table
> by its number.
> 
> As the IOMMU group data pointer points to variable type instead of
> iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
> This changes the tce_container struct to store iommu_group instead of
> iommu_table.
> 
> So, it was:
> - iommu_table points to iommu_group via iommu_table::it_group;
> - iommu_group points to iommu_table via iommu_group_get_iommudata();
> 
> now it is:
> - iommu_table points to iommu_group via iommu_table::it_group;
> - iommu_group points to spapr_tce_iommu_group via
> iommu_group_get_iommudata();
> - spapr_tce_iommu_group points to either (depending on .get_table()):
> 	- iommu_table;
> 	- pnv_ioda_pe;
> 
> This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
> have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
> will only be used for IODA1.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h            |   6 ++
>  arch/powerpc/include/asm/tce.h              |  13 +++
>  arch/powerpc/kernel/iommu.c                 |  35 ++++++-
>  arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +++++-
>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>  arch/powerpc/platforms/pseries/iommu.c      |  10 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c         | 148 ++++++++++++++++++++++------
>  8 files changed, 208 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 42632c7..84ee339 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>   */
>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>  					    int nid);
> +
> +struct spapr_tce_iommu_ops;
>  #ifdef CONFIG_IOMMU_API
>  extern void iommu_register_group(struct iommu_table *tbl,
> +				 void *iommu_owner,
> +				 struct spapr_tce_iommu_ops *ops,
>  				 int pci_domain_number, unsigned long pe_num);
>  extern int iommu_add_device(struct device *dev);
>  extern void iommu_del_device(struct device *dev);
>  #else
>  static inline void iommu_register_group(struct iommu_table *tbl,
> +					void *iommu_owner,
> +					struct spapr_tce_iommu_ops *ops,
>  					int pci_domain_number,
>  					unsigned long pe_num)
>  {
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index 743f36b..9f159eb 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -50,5 +50,18 @@
>  #define TCE_PCI_READ		0x1		/* read from PCI allowed */
>  #define TCE_VB_WRITE		0x1		/* write from VB allowed */
>  
> +struct spapr_tce_iommu_group;
> +
> +struct spapr_tce_iommu_ops {
> +	struct iommu_table *(*get_table)(
> +			struct spapr_tce_iommu_group *data,
> +			int num);
> +};
> +
> +struct spapr_tce_iommu_group {
> +	void *iommu_owner;
> +	struct spapr_tce_iommu_ops *ops;
> +};
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_TCE_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index b378f78..1c5dae7 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
>   */
>  static void group_release(void *iommu_data)
>  {
> -	struct iommu_table *tbl = iommu_data;
> -	tbl->it_group = NULL;
> +	kfree(iommu_data);
>  }
>  
> +static struct iommu_table *spapr_tce_default_get_table(
> +		struct spapr_tce_iommu_group *data, int num)
> +{
> +	struct iommu_table *tbl = data->iommu_owner;
> +
> +	switch (num) {
> +	case 0:
> +		if (tbl->it_size)
> +			return tbl;
> +		/* fallthru */
> +	default:
> +		return NULL;
> +	}
> +}
> +
> +static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
> +	.get_table = spapr_tce_default_get_table
> +};
> +
>  void iommu_register_group(struct iommu_table *tbl,
> +		void *iommu_owner, struct spapr_tce_iommu_ops *ops,
>  		int pci_domain_number, unsigned long pe_num)
>  {
>  	struct iommu_group *grp;
>  	char *name;
> +	struct spapr_tce_iommu_group *data;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return;
> +
> +	data->iommu_owner = iommu_owner ? iommu_owner : tbl;
> +	data->ops = ops ? ops : &spapr_tce_default_ops;
>  
>  	grp = iommu_group_alloc();
>  	if (IS_ERR(grp)) {
>  		pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
>  				PTR_ERR(grp));
> +		kfree(data);
>  		return;
>  	}
> +
>  	tbl->it_group = grp;
> -	iommu_group_set_iommudata(grp, tbl, group_release);
> +	iommu_group_set_iommudata(grp, data, group_release);
>  	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>  			pci_domain_number, pe_num);
>  	if (!name)
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 136e765..2d32a1c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -23,6 +23,7 @@
>  #include <linux/io.h>
>  #include <linux/msi.h>
>  #include <linux/memblock.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/sections.h>
>  #include <asm/io.h>
> @@ -988,6 +989,26 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>  	}
>  }
>  
> +static struct iommu_table *pnv_ioda1_iommu_get_table(
> +		struct spapr_tce_iommu_group *data,
> +		int num)
> +{
> +	struct pnv_ioda_pe *pe = data->iommu_owner;
> +
> +	switch (num) {
> +	case 0:
> +		if (pe->tce32.table.it_size)
> +			return &pe->tce32.table;
> +		/* fallthru */
> +	default:
> +		return NULL;
> +	}
> +}
> +
> +static struct spapr_tce_iommu_ops pnv_pci_ioda1_ops = {
> +	.get_table = pnv_ioda1_iommu_get_table,
> +};
> +
>  static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  				      struct pnv_ioda_pe *pe, unsigned int base,
>  				      unsigned int segs)
> @@ -1067,7 +1088,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  				 TCE_PCI_SWINV_PAIR);
>  	}
>  	iommu_init_table(tbl, phb->hose->node);
> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
> +	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
> +			phb->hose->global_number, pe->pe_number);
>  
>  	if (pe->pdev)
>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> @@ -1137,6 +1159,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>  	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
>  }
>  
> +static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
> +	.get_table = pnv_ioda1_iommu_get_table,
> +};
> +
>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  				       struct pnv_ioda_pe *pe)
>  {
> @@ -1202,7 +1228,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>  	}
>  	iommu_init_table(tbl, phb->hose->node);
> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
> +	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
> +			phb->hose->global_number, pe->pe_number);
>  
>  	if (pe->pdev)
>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> index 94ce348..b79066d 100644
> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
> @@ -89,6 +89,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
>  	if (phb->p5ioc2.iommu_table.it_map == NULL) {
>  		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
>  		iommu_register_group(&phb->p5ioc2.iommu_table,
> +				NULL, NULL,
>  				pci_domain_nr(phb->hose->bus), phb->opal_id);
>  	}
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 97895d4..6ffac79 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -723,7 +723,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
>  	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
>  				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
>  	iommu_init_table(tbl, hose->node);
> -	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
> +	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
>  
>  	/* Deal with SW invalidated TCEs when needed (BML way) */
>  	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 4642d6a..b95f8cf 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -616,7 +616,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
>  
>  	iommu_table_setparms(pci->phb, dn, tbl);
>  	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> -	iommu_register_group(tbl, pci_domain_nr(bus), 0);
> +	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
>  
>  	/* Divide the rest (1.75GB) among the children */
>  	pci->phb->dma_window_size = 0x80000000ul;
> @@ -661,7 +661,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
>  				   ppci->phb->node);
>  		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
>  		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
> -		iommu_register_group(tbl, pci_domain_nr(bus), 0);
> +		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
>  		pr_debug("  created table: %p\n", ppci->iommu_table);
>  	}
>  }
> @@ -688,7 +688,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
>  				   phb->node);
>  		iommu_table_setparms(phb, dn, tbl);
>  		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
> -		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
> +		iommu_register_group(tbl, NULL, NULL,
> +				pci_domain_nr(phb->bus), 0);
>  		set_iommu_table_base_and_group(&dev->dev,
>  					       PCI_DN(dn)->iommu_table);
>  		return;
> @@ -1105,7 +1106,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>  				   pci->phb->node);
>  		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
>  		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
> -		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
> +		iommu_register_group(tbl, NULL, NULL,
> +				pci_domain_nr(pci->phb->bus), 0);
>  		pr_debug("  created table: %p\n", pci->iommu_table);
>  	} else {
>  		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 730b4ef..a8adfbd 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -43,17 +43,51 @@ static void tce_iommu_detach_group(void *iommu_data,
>   */
>  struct tce_container {
>  	struct mutex lock;
> -	struct iommu_table *tbl;
> +	struct iommu_group *grp;
> +	long windows_num;
>  	bool enabled;
>  };
>  
> +static struct iommu_table *spapr_tce_find_table(
> +		struct tce_container *container,
> +		struct spapr_tce_iommu_group *data,
> +		phys_addr_t ioba)
> +{
> +	long i;
> +	struct iommu_table *ret = NULL;
> +
> +	mutex_lock(&container->lock);
> +	for (i = 0; i < container->windows_num; ++i) {
> +		struct iommu_table *tbl = data->ops->get_table(data, i);
> +
> +		if (tbl) {
> +			unsigned long entry = ioba >> tbl->it_page_shift;
> +			unsigned long start = tbl->it_offset;
> +			unsigned long end = start + tbl->it_size;
> +
> +			if ((start <= entry) && (entry < end)) {
> +				ret = tbl;
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&container->lock);
> +
> +	return ret;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
>  	unsigned long locked, lock_limit, npages;
> -	struct iommu_table *tbl = container->tbl;
> +	struct iommu_table *tbl;
> +	struct spapr_tce_iommu_group *data;
>  
> -	if (!container->tbl)
> +	if (!container->grp)
> +		return -ENXIO;
> +
> +	data = iommu_group_get_iommudata(container->grp);
> +	if (!data || !data->iommu_owner || !data->ops->get_table)
>  		return -ENXIO;
>  
>  	if (!current->mm)
> @@ -80,6 +114,10 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * that would effectively kill the guest at random points, much better
>  	 * enforcing the limit based on the max that the guest can map.
>  	 */
> +	tbl = data->ops->get_table(data, 0);

Can we define an enum to avoid sprinkling magic zeros in the code?

> +	if (!tbl)
> +		return -ENXIO;
> +
>  	down_write(&current->mm->mmap_sem);
>  	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>  	locked = current->mm->locked_vm + npages;
> @@ -89,7 +127,6 @@ static int tce_iommu_enable(struct tce_container *container)
>  				rlimit(RLIMIT_MEMLOCK));
>  		ret = -ENOMEM;
>  	} else {
> -
>  		current->mm->locked_vm += npages;
>  		container->enabled = true;
>  	}
> @@ -100,16 +137,27 @@ static int tce_iommu_enable(struct tce_container *container)
>  
>  static void tce_iommu_disable(struct tce_container *container)
>  {
> +	struct spapr_tce_iommu_group *data;
> +	struct iommu_table *tbl;
> +
>  	if (!container->enabled)
>  		return;
>  
>  	container->enabled = false;
>  
> -	if (!container->tbl || !current->mm)
> +	if (!container->grp || !current->mm)
> +		return;
> +
> +	data = iommu_group_get_iommudata(container->grp);
> +	if (!data || !data->iommu_owner || !data->ops->get_table)
> +		return;
> +
> +	tbl = data->ops->get_table(data, 0);
> +	if (!tbl)
>  		return;
>  
>  	down_write(&current->mm->mmap_sem);
> -	current->mm->locked_vm -= (container->tbl->it_size <<
> +	current->mm->locked_vm -= (tbl->it_size <<
>  			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>  	up_write(&current->mm->mmap_sem);
>  }
> @@ -129,6 +177,8 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	mutex_init(&container->lock);
>  
> +	container->windows_num = 1;
> +
>  	return container;
>  }
>  
> @@ -136,11 +186,11 @@ static void tce_iommu_release(void *iommu_data)
>  {
>  	struct tce_container *container = iommu_data;
>  
> -	WARN_ON(container->tbl && !container->tbl->it_group);
> +	WARN_ON(container->grp);
>  	tce_iommu_disable(container);
>  
> -	if (container->tbl && container->tbl->it_group)
> -		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
> +	if (container->grp)
> +		tce_iommu_detach_group(iommu_data, container->grp);
>  
>  	mutex_destroy(&container->lock);
>  
> @@ -169,8 +219,17 @@ static long tce_iommu_ioctl(void *iommu_data,
>  
>  	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>  		struct vfio_iommu_spapr_tce_info info;
> -		struct iommu_table *tbl = container->tbl;
> +		struct iommu_table *tbl;
> +		struct spapr_tce_iommu_group *data;
>  
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
> +			return -ENXIO;
> +
> +		tbl = data->ops->get_table(data, 0);
>  		if (WARN_ON(!tbl))
>  			return -ENXIO;
>  
> @@ -194,13 +253,16 @@ static long tce_iommu_ioctl(void *iommu_data,
>  	}
>  	case VFIO_IOMMU_MAP_DMA: {
>  		struct vfio_iommu_type1_dma_map param;
> -		struct iommu_table *tbl = container->tbl;
> +		struct iommu_table *tbl;
> +		struct spapr_tce_iommu_group *data;
>  		unsigned long tce, i;
>  
> -		if (!tbl)
> +		if (WARN_ON(!container->grp))

If a user can get here by their own mistake, return an error and be
done, no warning.  If a user can get here via a kernel ordering issue,
it's a problem in the design.  Which is it?

>  			return -ENXIO;
>  
> -		BUG_ON(!tbl->it_group);
> +		data = iommu_group_get_iommudata(container->grp);
> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
> +			return -ENXIO;

Same

>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -225,6 +287,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
>  			tce |= TCE_PCI_WRITE;
>  
> +		tbl = spapr_tce_find_table(container, data, param.iova);

Why aren't we using ops->find_table() here?

> +		if (!tbl)
> +			return -ENXIO;
> +		BUG_ON(!tbl->it_group);
> +
>  		ret = iommu_tce_put_param_check(tbl, param.iova, tce);
>  		if (ret)
>  			return ret;
> @@ -247,9 +314,14 @@ static long tce_iommu_ioctl(void *iommu_data,
>  	}
>  	case VFIO_IOMMU_UNMAP_DMA: {
>  		struct vfio_iommu_type1_dma_unmap param;
> -		struct iommu_table *tbl = container->tbl;
> +		struct iommu_table *tbl;
> +		struct spapr_tce_iommu_group *data;
>  
> -		if (WARN_ON(!tbl))
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
>  			return -ENXIO;

Same as above on both of these.

>  
>  		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
> @@ -268,6 +340,12 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		if (param.size & ~IOMMU_PAGE_MASK_4K)
>  			return -EINVAL;
>  
> +		tbl = spapr_tce_find_table(container, data, param.iova);

ops->find_table()?

> +		if (!tbl)
> +			return -ENXIO;
> +
> +		BUG_ON(!tbl->it_group);
> +
>  		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
>  				param.size >> IOMMU_PAGE_SHIFT_4K);
>  		if (ret)
> @@ -293,10 +371,10 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		mutex_unlock(&container->lock);
>  		return 0;
>  	case VFIO_EEH_PE_OP:
> -		if (!container->tbl || !container->tbl->it_group)
> +		if (!container->grp)
>  			return -ENODEV;
>  
> -		return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
> +		return vfio_spapr_iommu_eeh_ioctl(container->grp,
>  						  cmd, arg);
>  	}
>  
> @@ -308,16 +386,16 @@ static int tce_iommu_attach_group(void *iommu_data,
>  {
>  	int ret;
>  	struct tce_container *container = iommu_data;
> -	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +	struct iommu_table *tbl;
> +	struct spapr_tce_iommu_group *data;
>  
> -	BUG_ON(!tbl);
>  	mutex_lock(&container->lock);
>  
>  	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
>  			iommu_group_id(iommu_group), iommu_group); */
> -	if (container->tbl) {
> +	if (container->grp) {
>  		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
> -				iommu_group_id(container->tbl->it_group),
> +				iommu_group_id(container->grp),
>  				iommu_group_id(iommu_group));
>  		ret = -EBUSY;
>  	} else if (container->enabled) {
> @@ -325,9 +403,16 @@ static int tce_iommu_attach_group(void *iommu_data,
>  				iommu_group_id(iommu_group));
>  		ret = -EBUSY;
>  	} else {
> +		data = iommu_group_get_iommudata(iommu_group);
> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
> +			return -ENXIO;
> +
> +		tbl = data->ops->get_table(data, 0);
> +		BUG_ON(!tbl);

Why BUG_ON?  A user can attempt to attach any group to any container, do
you really want to give them the ability to halt the system?

> +
>  		ret = iommu_take_ownership(tbl);
>  		if (!ret)
> -			container->tbl = tbl;
> +			container->grp = iommu_group;
>  	}
>  
>  	mutex_unlock(&container->lock);
> @@ -339,24 +424,31 @@ static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group)
>  {
>  	struct tce_container *container = iommu_data;
> -	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +	struct iommu_table *tbl;
> +	struct spapr_tce_iommu_group *data;
>  
> -	BUG_ON(!tbl);
>  	mutex_lock(&container->lock);
> -	if (tbl != container->tbl) {
> +	if (iommu_group != container->grp) {
>  		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
>  				iommu_group_id(iommu_group),
> -				iommu_group_id(tbl->it_group));
> +				iommu_group_id(container->grp));
>  	} else {
>  		if (container->enabled) {
>  			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
> -					iommu_group_id(tbl->it_group));
> +					iommu_group_id(container->grp));
>  			tce_iommu_disable(container);
>  		}
>  
>  		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
>  				iommu_group_id(iommu_group), iommu_group); */
> -		container->tbl = NULL;
> +		container->grp = NULL;
> +
> +		data = iommu_group_get_iommudata(iommu_group);
> +		BUG_ON(!data || !data->iommu_owner || !data->ops);
> +
> +		tbl = data->ops->get_table(data, 0);
> +		BUG_ON(!tbl);
> +
>  		iommu_release_ownership(tbl);
>  	}
>  	mutex_unlock(&container->lock);


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-09-23  3:00 ` [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
@ 2014-09-23 21:00   ` Alex Williamson
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Williamson @ 2014-09-23 21:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Gavin Shan, linux-kernel, kvm

On Tue, 2014-09-23 at 13:00 +1000, Alexey Kardashevskiy wrote:
> At the moment the iommu_table struct has a set_bypass() which enables/
> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> which calls this callback when external IOMMU users such as VFIO are
> about to get over a PHB.
> 
> Since the set_bypass() is not really an iommu_table function but PE's
> function, and we have an ops struct per IOMMU owner, let's move
> set_bypass() to the spapr_tce_iommu_ops struct.
> 
> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
> has very little to do with PEs, this moves take_ownership() calls to
> the VFIO SPAPR TCE driver.
> 
> This renames set_bypass() to take_ownership() as it is not necessarily
> just enabling bypassing, it can be something else/more so let's give it
> a generic name. The bool parameter is inverted.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/iommu.h          |  1 -
>  arch/powerpc/include/asm/tce.h            |  2 ++
>  arch/powerpc/kernel/iommu.c               | 12 ------------
>  arch/powerpc/platforms/powernv/pci-ioda.c | 20 ++++++++++++--------
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
>  5 files changed, 30 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 84ee339..2b0b01d 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -77,7 +77,6 @@ struct iommu_table {
>  #ifdef CONFIG_IOMMU_API
>  	struct iommu_group *it_group;
>  #endif
> -	void (*set_bypass)(struct iommu_table *tbl, bool enable);
>  };
>  
>  /* Pure 2^n version of get_order */
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index 9f159eb..e6355f9 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops {
>  	struct iommu_table *(*get_table)(
>  			struct spapr_tce_iommu_group *data,
>  			int num);
> +	void (*take_ownership)(struct spapr_tce_iommu_group *data,
> +			bool enable);

"set" is a better verb when using a bool to specify direction, imho.

This is pretty confusing now that we have

iommu_take_ownership()
data->ops->take_ownership(true)

iommu_release_ownership()
data->ops->take_ownership(false)

And there's zero comments here about what take_ownership is supposed to
provide, or get_table for that matter.

>  };
>  
>  struct spapr_tce_iommu_group {
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 1c5dae7..c2c8d9d 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>  	memset(tbl->it_map, 0xff, sz);
>  	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
>  
> -	/*
> -	 * Disable iommu bypass, otherwise the user can DMA to all of
> -	 * our physical memory via the bypass window instead of just
> -	 * the pages that has been explicitly mapped into the iommu
> -	 */
> -	if (tbl->set_bypass)
> -		tbl->set_bypass(tbl, false);
> -
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
> @@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  	/* Restore bit#0 set by iommu_init_table() */
>  	if (tbl->it_offset == 0)
>  		set_bit(0, tbl->it_map);
> -
> -	/* The kernel owns the device now, we can restore the iommu bypass */
> -	if (tbl->set_bypass)
> -		tbl->set_bypass(tbl, true);
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 2d32a1c..8cb2f31 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>  }
>  
> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>  {
> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> -					      tce32.table);
>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>  	int64_t rc;
>  
> @@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>  		 * host side.
>  		 */
>  		if (pe->pdev)
> -			set_iommu_table_base(&pe->pdev->dev, tbl);
> +			set_iommu_table_base(&pe->pdev->dev, &pe->tce32.table);
>  		else
>  			pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
>  	}
> @@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>  	/* TVE #1 is selected by PCI address bit 59 */
>  	pe->tce_bypass_base = 1ull << 59;
>  
> -	/* Install set_bypass callback for VFIO */
> -	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
> -
>  	/* Enable bypass by default */
> -	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
> +	pnv_pci_ioda2_set_bypass(pe, true);
> +}
> +
> +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
> +				     bool enable)
> +{
> +	struct pnv_ioda_pe *pe = data->iommu_owner;
> +
> +	pnv_pci_ioda2_set_bypass(pe, !enable);
>  }
>  
>  static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
>  	.get_table = pnv_ioda1_iommu_get_table,
> +	.take_ownership = pnv_ioda2_take_ownership,
>  };
>  
>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a8adfbd..1c1a9c4 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -76,6 +76,13 @@ static struct iommu_table *spapr_tce_find_table(
>  	return ret;
>  }
>  
> +static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
> +		bool enable)
> +{
> +	if (data && data->ops && data->ops->take_ownership)
> +		data->ops->take_ownership(data, enable);
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> @@ -413,6 +420,12 @@ static int tce_iommu_attach_group(void *iommu_data,
>  		ret = iommu_take_ownership(tbl);
>  		if (!ret)
>  			container->grp = iommu_group;
> +		/*
> +		 * Disable iommu bypass, otherwise the user can DMA to all of
> +		 * our physical memory via the bypass window instead of just
> +		 * the pages that has been explicitly mapped into the iommu
> +		 */
> +		tce_iommu_take_ownership_notify(data, true);
>  	}
>  
>  	mutex_unlock(&container->lock);
> @@ -450,6 +463,9 @@ static void tce_iommu_detach_group(void *iommu_data,
>  		BUG_ON(!tbl);
>  
>  		iommu_release_ownership(tbl);
> +
> +		/* Kernel owns the device now, we can restore bypass */
> +		tce_iommu_take_ownership_notify(data, false);
>  	}
>  	mutex_unlock(&container->lock);
>  }




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
  2014-09-23  3:01 ` [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
@ 2014-09-23 21:56   ` Alex Williamson
  2014-10-10 18:33     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2014-09-23 21:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Gavin Shan, linux-kernel, kvm

On Tue, 2014-09-23 at 13:01 +1000, Alexey Kardashevskiy wrote:
> This defines and implements VFIO IOMMU API which lets the userspace
> create and remove DMA windows.
> 
> This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
> available windows and page mask.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
> to allow the user space to create and remove window(s).
> 
> The VFIO IOMMU driver does basic sanity checks and calls corresponding
> SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
> implements them.
> 
> This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
> VFIO_IOMMU_SPAPR_TCE_GET_INFO.
> 
> This calls platform DDW reset() callback when IOMMU is being disabled
> to reset the DMA configuration to its original state.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/vfio.h           |  25 ++++++-
>  2 files changed, 153 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 0dccbc4..b518891 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container *container)
>  
>  	container->enabled = false;
>  
> -	if (!container->grp || !current->mm)
> +	if (!container->grp)
>  		return;
>  
>  	data = iommu_group_get_iommudata(container->grp);
>  	if (!data || !data->iommu_owner || !data->ops->get_table)
>  		return;
>  
> -	tbl = data->ops->get_table(data, 0);
> -	if (!tbl)
> -		return;
> +	if (current->mm) {
> +		tbl = data->ops->get_table(data, 0);
> +		if (tbl)
> +			decrement_locked_vm(tbl);
>  
> -	decrement_locked_vm(tbl);
> +		tbl = data->ops->get_table(data, 1);
> +		if (tbl)
> +			decrement_locked_vm(tbl);
> +	}
> +
> +	if (data->ops->reset)
> +		data->ops->reset(data);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>  				 unsigned int cmd, unsigned long arg)
>  {
>  	struct tce_container *container = iommu_data;
> -	unsigned long minsz;
> +	unsigned long minsz, ddwsz;
>  	long ret;
>  
>  	switch (cmd) {
> @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
>  		info.flags = 0;
>  
> +		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> +				page_size_mask);
> +
> +		if (info.argsz == ddwsz) {

>=

> +			if (data->ops->query && data->ops->create &&
> +					data->ops->remove) {
> +				info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;

I think you want to set this flag regardless of whether the user has
provided space for it.  A valid use model is to call with the minimum
size and look at the flags to determine if it needs to be called again
with a larger size.

> +
> +				ret = data->ops->query(data,
> +						&info.current_windows,
> +						&info.windows_available,
> +						&info.page_size_mask);
> +				if (ret)
> +					return ret;
> +			} else {
> +				info.current_windows = 0;
> +				info.windows_available = 0;
> +				info.page_size_mask = 0;
> +			}
> +			minsz = ddwsz;

It's not really any longer the min size, is it?

> +		}
> +
>  		if (copy_to_user((void __user *)arg, &info, minsz))
>  			return -EFAULT;
>  
> @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		tce_iommu_disable(container);
>  		mutex_unlock(&container->lock);
>  		return 0;
> +
>  	case VFIO_EEH_PE_OP:
>  		if (!container->grp)
>  			return -ENODEV;
>  
>  		return vfio_spapr_iommu_eeh_ioctl(container->grp,
>  						  cmd, arg);
> +
> +	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
> +		struct vfio_iommu_spapr_tce_create create;
> +		struct spapr_tce_iommu_group *data;
> +		struct iommu_table *tbl;
> +
> +		if (WARN_ON(!container->grp))

redux previous comment on this warning

> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
> +				start_addr);
> +
> +		if (copy_from_user(&create, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (create.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (create.flags)
> +			return -EINVAL;
> +
> +		if (!data->ops->create || !data->iommu_owner)
> +			return -ENOSYS;
> +
> +		BUG_ON(!data || !data->ops || !data->ops->remove);

Little late for this test since we'll oops on the previous test.  Why is
this a BUG_ON?  A user could exploit this on a system with only a
partial set of callbacks.

> +
> +		ret = data->ops->create(data, create.page_shift,
> +				create.window_shift, &tbl);
> +		if (ret)
> +			return ret;
> +
> +		ret = try_increment_locked_vm(tbl);
> +		if (ret) {
> +			data->ops->remove(data, tbl);
> +			return ret;
> +		}
> +
> +		create.start_addr = tbl->it_offset << tbl->it_page_shift;
> +
> +		if (copy_to_user((void __user *)arg, &create, minsz)) {
> +			data->ops->remove(data, tbl);
> +			decrement_locked_vm(tbl);
> +			return -EFAULT;
> +		}
> +		mutex_lock(&container->lock);
> +		++container->windows_num;
> +		mutex_unlock(&container->lock);
> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
> +		struct vfio_iommu_spapr_tce_remove remove;
> +		struct spapr_tce_iommu_group *data;
> +		struct iommu_table *tbl;
> +
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
> +				start_addr);
> +
> +		if (copy_from_user(&remove, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (remove.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (remove.flags)
> +			return -EINVAL;
> +
> +		if (!data->ops->remove || !data->iommu_owner)

On this one we don't both to get data/data->ops.  Is there also an
exploit where the user can call these CREATE/REMOVE interfaces even
though INFO doesn't expose them if only a partial set of callbacks are
present?

> +			return -ENOSYS;
> +
> +		tbl = spapr_tce_find_table(container, data, remove.start_addr);

What happens if this returns the 0 index rather than the expected 1
index table?  Why doesn't this call ops->find_table()?

> +		if (!tbl)
> +			return -EINVAL;
> +
> +		ret = data->ops->remove(data, tbl);
> +		if (ret)
> +			return ret;
> +
> +		decrement_locked_vm(tbl);
> +
> +		mutex_lock(&container->lock);
> +		--container->windows_num;
> +		mutex_unlock(&container->lock);
> +
> +		return 0;
> +	}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 6612974..e71a6ef 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -451,9 +451,13 @@ struct vfio_iommu_type1_dma_unmap {
>   */
>  struct vfio_iommu_spapr_tce_info {
>  	__u32 argsz;
> -	__u32 flags;			/* reserved for future use */
> +	__u32 flags;
> +#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
> +	__u32 current_windows;
> +	__u32 windows_available;
> +	__u32 page_size_mask;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> @@ -489,6 +493,25 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +struct vfio_iommu_spapr_tce_create {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u32 page_shift;
> +	__u32 window_shift;
> +	/* out */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> +struct vfio_iommu_spapr_tce_remove {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +

Zero comments, no good.

>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  2014-09-23 20:42   ` Alex Williamson
@ 2014-09-25  5:19     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-09-25  5:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Gavin Shan, linux-kernel, kvm

On 09/24/2014 06:42 AM, Alex Williamson wrote:
> On Tue, 2014-09-23 at 13:00 +1000, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple IOMMU tables per PE
>> so we need a more reliable way (compared to container_of()) to get
>> a PE pointer from the iommu_table struct pointer used in IOMMU functions.
>>
>> At the moment IOMMU group data points to an iommu_table struct. This
>> introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
>> and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
>> the pnv_ioda_pe struct, for others it is still a pointer to
>> the iommu_table struct. The ops structs correspond to the type which
>> iommu_owner points to.
>>
>> This defines a get_table() callback which returns an iommu_table
>> by its number.
>>
>> As the IOMMU group data pointer points to variable type instead of
>> iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
>> This changes the tce_container struct to store iommu_group instead of
>> iommu_table.
>>
>> So, it was:
>> - iommu_table points to iommu_group via iommu_table::it_group;
>> - iommu_group points to iommu_table via iommu_group_get_iommudata();
>>
>> now it is:
>> - iommu_table points to iommu_group via iommu_table::it_group;
>> - iommu_group points to spapr_tce_iommu_group via
>> iommu_group_get_iommudata();
>> - spapr_tce_iommu_group points to either (depending on .get_table()):
>> 	- iommu_table;
>> 	- pnv_ioda_pe;
>>
>> This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
>> have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
>> will only be used for IODA1.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/iommu.h            |   6 ++
>>  arch/powerpc/include/asm/tce.h              |  13 +++
>>  arch/powerpc/kernel/iommu.c                 |  35 ++++++-
>>  arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +++++-
>>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
>>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>>  arch/powerpc/platforms/pseries/iommu.c      |  10 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c         | 148 ++++++++++++++++++++++------
>>  8 files changed, 208 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 42632c7..84ee339 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>>   */
>>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>  					    int nid);
>> +
>> +struct spapr_tce_iommu_ops;
>>  #ifdef CONFIG_IOMMU_API
>>  extern void iommu_register_group(struct iommu_table *tbl,
>> +				 void *iommu_owner,
>> +				 struct spapr_tce_iommu_ops *ops,
>>  				 int pci_domain_number, unsigned long pe_num);
>>  extern int iommu_add_device(struct device *dev);
>>  extern void iommu_del_device(struct device *dev);
>>  #else
>>  static inline void iommu_register_group(struct iommu_table *tbl,
>> +					void *iommu_owner,
>> +					struct spapr_tce_iommu_ops *ops,
>>  					int pci_domain_number,
>>  					unsigned long pe_num)
>>  {
>> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
>> index 743f36b..9f159eb 100644
>> --- a/arch/powerpc/include/asm/tce.h
>> +++ b/arch/powerpc/include/asm/tce.h
>> @@ -50,5 +50,18 @@
>>  #define TCE_PCI_READ		0x1		/* read from PCI allowed */
>>  #define TCE_VB_WRITE		0x1		/* write from VB allowed */
>>  
>> +struct spapr_tce_iommu_group;
>> +
>> +struct spapr_tce_iommu_ops {
>> +	struct iommu_table *(*get_table)(
>> +			struct spapr_tce_iommu_group *data,
>> +			int num);
>> +};
>> +
>> +struct spapr_tce_iommu_group {
>> +	void *iommu_owner;
>> +	struct spapr_tce_iommu_ops *ops;
>> +};
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _ASM_POWERPC_TCE_H */
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index b378f78..1c5dae7 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
>>   */
>>  static void group_release(void *iommu_data)
>>  {
>> -	struct iommu_table *tbl = iommu_data;
>> -	tbl->it_group = NULL;
>> +	kfree(iommu_data);
>>  }
>>  
>> +static struct iommu_table *spapr_tce_default_get_table(
>> +		struct spapr_tce_iommu_group *data, int num)
>> +{
>> +	struct iommu_table *tbl = data->iommu_owner;
>> +
>> +	switch (num) {
>> +	case 0:
>> +		if (tbl->it_size)
>> +			return tbl;
>> +		/* fallthru */
>> +	default:
>> +		return NULL;
>> +	}
>> +}
>> +
>> +static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
>> +	.get_table = spapr_tce_default_get_table
>> +};
>> +
>>  void iommu_register_group(struct iommu_table *tbl,
>> +		void *iommu_owner, struct spapr_tce_iommu_ops *ops,
>>  		int pci_domain_number, unsigned long pe_num)
>>  {
>>  	struct iommu_group *grp;
>>  	char *name;
>> +	struct spapr_tce_iommu_group *data;
>> +
>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>> +	if (!data)
>> +		return;
>> +
>> +	data->iommu_owner = iommu_owner ? iommu_owner : tbl;
>> +	data->ops = ops ? ops : &spapr_tce_default_ops;
>>  
>>  	grp = iommu_group_alloc();
>>  	if (IS_ERR(grp)) {
>>  		pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
>>  				PTR_ERR(grp));
>> +		kfree(data);
>>  		return;
>>  	}
>> +
>>  	tbl->it_group = grp;
>> -	iommu_group_set_iommudata(grp, tbl, group_release);
>> +	iommu_group_set_iommudata(grp, data, group_release);
>>  	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
>>  			pci_domain_number, pe_num);
>>  	if (!name)
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 136e765..2d32a1c 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/io.h>
>>  #include <linux/msi.h>
>>  #include <linux/memblock.h>
>> +#include <linux/iommu.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/io.h>
>> @@ -988,6 +989,26 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
>>  	}
>>  }
>>  
>> +static struct iommu_table *pnv_ioda1_iommu_get_table(
>> +		struct spapr_tce_iommu_group *data,
>> +		int num)
>> +{
>> +	struct pnv_ioda_pe *pe = data->iommu_owner;
>> +
>> +	switch (num) {
>> +	case 0:
>> +		if (pe->tce32.table.it_size)
>> +			return &pe->tce32.table;
>> +		/* fallthru */
>> +	default:
>> +		return NULL;
>> +	}
>> +}
>> +
>> +static struct spapr_tce_iommu_ops pnv_pci_ioda1_ops = {
>> +	.get_table = pnv_ioda1_iommu_get_table,
>> +};
>> +
>>  static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  				      struct pnv_ioda_pe *pe, unsigned int base,
>>  				      unsigned int segs)
>> @@ -1067,7 +1088,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  				 TCE_PCI_SWINV_PAIR);
>>  	}
>>  	iommu_init_table(tbl, phb->hose->node);
>> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>> +	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
>> +			phb->hose->global_number, pe->pe_number);
>>  
>>  	if (pe->pdev)
>>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
>> @@ -1137,6 +1159,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>  	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
>>  }
>>  
>> +static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
>> +	.get_table = pnv_ioda1_iommu_get_table,
>> +};
>> +
>>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>  				       struct pnv_ioda_pe *pe)
>>  {
>> @@ -1202,7 +1228,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>>  		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
>>  	}
>>  	iommu_init_table(tbl, phb->hose->node);
>> -	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
>> +	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
>> +			phb->hose->global_number, pe->pe_number);
>>  
>>  	if (pe->pdev)
>>  		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
>> diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> index 94ce348..b79066d 100644
>> --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> +++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
>> @@ -89,6 +89,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
>>  	if (phb->p5ioc2.iommu_table.it_map == NULL) {
>>  		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
>>  		iommu_register_group(&phb->p5ioc2.iommu_table,
>> +				NULL, NULL,
>>  				pci_domain_nr(phb->hose->bus), phb->opal_id);
>>  	}
>>  
>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>> index 97895d4..6ffac79 100644
>> --- a/arch/powerpc/platforms/powernv/pci.c
>> +++ b/arch/powerpc/platforms/powernv/pci.c
>> @@ -723,7 +723,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
>>  	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
>>  				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
>>  	iommu_init_table(tbl, hose->node);
>> -	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
>> +	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
>>  
>>  	/* Deal with SW invalidated TCEs when needed (BML way) */
>>  	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 4642d6a..b95f8cf 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -616,7 +616,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
>>  
>>  	iommu_table_setparms(pci->phb, dn, tbl);
>>  	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
>> -	iommu_register_group(tbl, pci_domain_nr(bus), 0);
>> +	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
>>  
>>  	/* Divide the rest (1.75GB) among the children */
>>  	pci->phb->dma_window_size = 0x80000000ul;
>> @@ -661,7 +661,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
>>  				   ppci->phb->node);
>>  		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
>>  		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
>> -		iommu_register_group(tbl, pci_domain_nr(bus), 0);
>> +		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
>>  		pr_debug("  created table: %p\n", ppci->iommu_table);
>>  	}
>>  }
>> @@ -688,7 +688,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
>>  				   phb->node);
>>  		iommu_table_setparms(phb, dn, tbl);
>>  		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
>> -		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
>> +		iommu_register_group(tbl, NULL, NULL,
>> +				pci_domain_nr(phb->bus), 0);
>>  		set_iommu_table_base_and_group(&dev->dev,
>>  					       PCI_DN(dn)->iommu_table);
>>  		return;
>> @@ -1105,7 +1106,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
>>  				   pci->phb->node);
>>  		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
>>  		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
>> -		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
>> +		iommu_register_group(tbl, NULL, NULL,
>> +				pci_domain_nr(pci->phb->bus), 0);
>>  		pr_debug("  created table: %p\n", pci->iommu_table);
>>  	} else {
>>  		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 730b4ef..a8adfbd 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -43,17 +43,51 @@ static void tce_iommu_detach_group(void *iommu_data,
>>   */
>>  struct tce_container {
>>  	struct mutex lock;
>> -	struct iommu_table *tbl;
>> +	struct iommu_group *grp;
>> +	long windows_num;
>>  	bool enabled;
>>  };
>>  
>> +static struct iommu_table *spapr_tce_find_table(
>> +		struct tce_container *container,
>> +		struct spapr_tce_iommu_group *data,
>> +		phys_addr_t ioba)
>> +{
>> +	long i;
>> +	struct iommu_table *ret = NULL;
>> +
>> +	mutex_lock(&container->lock);
>> +	for (i = 0; i < container->windows_num; ++i) {
>> +		struct iommu_table *tbl = data->ops->get_table(data, i);
>> +
>> +		if (tbl) {
>> +			unsigned long entry = ioba >> tbl->it_page_shift;
>> +			unsigned long start = tbl->it_offset;
>> +			unsigned long end = start + tbl->it_size;
>> +
>> +			if ((start <= entry) && (entry < end)) {
>> +				ret = tbl;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	mutex_unlock(&container->lock);
>> +
>> +	return ret;
>> +}
>> +
>>  static int tce_iommu_enable(struct tce_container *container)
>>  {
>>  	int ret = 0;
>>  	unsigned long locked, lock_limit, npages;
>> -	struct iommu_table *tbl = container->tbl;
>> +	struct iommu_table *tbl;
>> +	struct spapr_tce_iommu_group *data;
>>  
>> -	if (!container->tbl)
>> +	if (!container->grp)
>> +		return -ENXIO;
>> +
>> +	data = iommu_group_get_iommudata(container->grp);
>> +	if (!data || !data->iommu_owner || !data->ops->get_table)
>>  		return -ENXIO;
>>  
>>  	if (!current->mm)
>> @@ -80,6 +114,10 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * that would effectively kill the guest at random points, much better
>>  	 * enforcing the limit based on the max that the guest can map.
>>  	 */
>> +	tbl = data->ops->get_table(data, 0);
> 
> Can we define an enum to avoid sprinkling magic zeros in the code?

Right. Missed it in one of iterations :-/


>> +	if (!tbl)
>> +		return -ENXIO;
>> +
>>  	down_write(&current->mm->mmap_sem);
>>  	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>>  	locked = current->mm->locked_vm + npages;
>> @@ -89,7 +127,6 @@ static int tce_iommu_enable(struct tce_container *container)
>>  				rlimit(RLIMIT_MEMLOCK));
>>  		ret = -ENOMEM;
>>  	} else {
>> -
>>  		current->mm->locked_vm += npages;
>>  		container->enabled = true;
>>  	}
>> @@ -100,16 +137,27 @@ static int tce_iommu_enable(struct tce_container *container)
>>  
>>  static void tce_iommu_disable(struct tce_container *container)
>>  {
>> +	struct spapr_tce_iommu_group *data;
>> +	struct iommu_table *tbl;
>> +
>>  	if (!container->enabled)
>>  		return;
>>  
>>  	container->enabled = false;
>>  
>> -	if (!container->tbl || !current->mm)
>> +	if (!container->grp || !current->mm)
>> +		return;
>> +
>> +	data = iommu_group_get_iommudata(container->grp);
>> +	if (!data || !data->iommu_owner || !data->ops->get_table)
>> +		return;
>> +
>> +	tbl = data->ops->get_table(data, 0);
>> +	if (!tbl)
>>  		return;
>>  
>>  	down_write(&current->mm->mmap_sem);
>> -	current->mm->locked_vm -= (container->tbl->it_size <<
>> +	current->mm->locked_vm -= (tbl->it_size <<
>>  			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>>  	up_write(&current->mm->mmap_sem);
>>  }
>> @@ -129,6 +177,8 @@ static void *tce_iommu_open(unsigned long arg)
>>  
>>  	mutex_init(&container->lock);
>>  
>> +	container->windows_num = 1;
>> +
>>  	return container;
>>  }
>>  
>> @@ -136,11 +186,11 @@ static void tce_iommu_release(void *iommu_data)
>>  {
>>  	struct tce_container *container = iommu_data;
>>  
>> -	WARN_ON(container->tbl && !container->tbl->it_group);
>> +	WARN_ON(container->grp);
>>  	tce_iommu_disable(container);
>>  
>> -	if (container->tbl && container->tbl->it_group)
>> -		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
>> +	if (container->grp)
>> +		tce_iommu_detach_group(iommu_data, container->grp);
>>  
>>  	mutex_destroy(&container->lock);
>>  
>> @@ -169,8 +219,17 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  
>>  	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>>  		struct vfio_iommu_spapr_tce_info info;
>> -		struct iommu_table *tbl = container->tbl;
>> +		struct iommu_table *tbl;
>> +		struct spapr_tce_iommu_group *data;
>>  
>> +		if (WARN_ON(!container->grp))
>> +			return -ENXIO;
>> +
>> +		data = iommu_group_get_iommudata(container->grp);
>> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
>> +			return -ENXIO;
>> +
>> +		tbl = data->ops->get_table(data, 0);
>>  		if (WARN_ON(!tbl))
>>  			return -ENXIO;
>>  
>> @@ -194,13 +253,16 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  	}
>>  	case VFIO_IOMMU_MAP_DMA: {
>>  		struct vfio_iommu_type1_dma_map param;
>> -		struct iommu_table *tbl = container->tbl;
>> +		struct iommu_table *tbl;
>> +		struct spapr_tce_iommu_group *data;
>>  		unsigned long tce, i;
>>  
>> -		if (!tbl)
>> +		if (WARN_ON(!container->grp))
> 
> If a user can get here by their own mistake, return an error and be
> done, no warning.  If a user can get here via a kernel ordering issue,
> it's a problem in the design.  Which is it?


I'll remove here and below these WARN's. I just cannot force myself to
start thinking of malicious userspace which can trigger many of those :)


> 
>>  			return -ENXIO;
>>  
>> -		BUG_ON(!tbl->it_group);
>> +		data = iommu_group_get_iommudata(container->grp);
>> +		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
>> +			return -ENXIO;
> 
> Same
> 
>>  
>>  		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -225,6 +287,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
>>  			tce |= TCE_PCI_WRITE;
>>  
>> +		tbl = spapr_tce_find_table(container, data, param.iova);
> 
> Why aren't we using ops->find_table() here?


I am trying to keep spapr_tce_iommu_ops as simple as possible and now it
only provides a search-by-window-number callback and the helper to search
by IOBA uses it.




-- 
Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
  2014-09-23 21:56   ` Alex Williamson
@ 2014-10-10 18:33     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Kardashevskiy @ 2014-10-10 18:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Gavin Shan, linux-kernel, kvm

On 09/23/2014 11:56 PM, Alex Williamson wrote:
> On Tue, 2014-09-23 at 13:01 +1000, Alexey Kardashevskiy wrote:
>> This defines and implements VFIO IOMMU API which lets the userspace
>> create and remove DMA windows.
>>
>> This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
>> available windows and page mask.
>>
>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
>> to allow the user space to create and remove window(s).
>>
>> The VFIO IOMMU driver does basic sanity checks and calls corresponding
>> SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
>> implements them.
>>
>> This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
>> VFIO_IOMMU_SPAPR_TCE_GET_INFO.
>>
>> This calls platform DDW reset() callback when IOMMU is being disabled
>> to reset the DMA configuration to its original state.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++++++++++++++++++++++++++++++++++--
>>  include/uapi/linux/vfio.h           |  25 ++++++-
>>  2 files changed, 153 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 0dccbc4..b518891 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container *container)
>>  
>>  	container->enabled = false;
>>  
>> -	if (!container->grp || !current->mm)
>> +	if (!container->grp)
>>  		return;
>>  
>>  	data = iommu_group_get_iommudata(container->grp);
>>  	if (!data || !data->iommu_owner || !data->ops->get_table)
>>  		return;
>>  
>> -	tbl = data->ops->get_table(data, 0);
>> -	if (!tbl)
>> -		return;
>> +	if (current->mm) {
>> +		tbl = data->ops->get_table(data, 0);
>> +		if (tbl)
>> +			decrement_locked_vm(tbl);
>>  
>> -	decrement_locked_vm(tbl);
>> +		tbl = data->ops->get_table(data, 1);
>> +		if (tbl)
>> +			decrement_locked_vm(tbl);
>> +	}
>> +
>> +	if (data->ops->reset)
>> +		data->ops->reset(data);
>>  }
>>  
>>  static void *tce_iommu_open(unsigned long arg)
>> @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  				 unsigned int cmd, unsigned long arg)
>>  {
>>  	struct tce_container *container = iommu_data;
>> -	unsigned long minsz;
>> +	unsigned long minsz, ddwsz;
>>  	long ret;
>>  
>>  	switch (cmd) {
>> @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
>>  		info.flags = 0;
>>  
>> +		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
>> +				page_size_mask);
>> +
>> +		if (info.argsz == ddwsz) {
> 
>> =
> 
>> +			if (data->ops->query && data->ops->create &&
>> +					data->ops->remove) {
>> +				info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
> 
> I think you want to set this flag regardless of whether the user has
> provided space for it.  A valid use model is to call with the minimum
> size and look at the flags to determine if it needs to be called again
> with a larger size.
> 
>> +
>> +				ret = data->ops->query(data,
>> +						&info.current_windows,
>> +						&info.windows_available,
>> +						&info.page_size_mask);
>> +				if (ret)
>> +					return ret;
>> +			} else {
>> +				info.current_windows = 0;
>> +				info.windows_available = 0;
>> +				info.page_size_mask = 0;
>> +			}
>> +			minsz = ddwsz;
> 
> It's not really any longer the min size, is it?
> 
>> +		}
>> +
>>  		if (copy_to_user((void __user *)arg, &info, minsz))
>>  			return -EFAULT;
>>  
>> @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		tce_iommu_disable(container);
>>  		mutex_unlock(&container->lock);
>>  		return 0;
>> +
>>  	case VFIO_EEH_PE_OP:
>>  		if (!container->grp)
>>  			return -ENODEV;
>>  
>>  		return vfio_spapr_iommu_eeh_ioctl(container->grp,
>>  						  cmd, arg);
>> +
>> +	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
>> +		struct vfio_iommu_spapr_tce_create create;
>> +		struct spapr_tce_iommu_group *data;
>> +		struct iommu_table *tbl;
>> +
>> +		if (WARN_ON(!container->grp))
> 
> redux previous comment on this warning
> 
>> +			return -ENXIO;
>> +
>> +		data = iommu_group_get_iommudata(container->grp);
>> +
>> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
>> +				start_addr);
>> +
>> +		if (copy_from_user(&create, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (create.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (create.flags)
>> +			return -EINVAL;
>> +
>> +		if (!data->ops->create || !data->iommu_owner)
>> +			return -ENOSYS;
>> +
>> +		BUG_ON(!data || !data->ops || !data->ops->remove);
> 
> Little late for this test since we'll oops on the previous test.  Why is
> this a BUG_ON?  A user could exploit this on a system with only a
> partial set of callbacks.
> 
>> +
>> +		ret = data->ops->create(data, create.page_shift,
>> +				create.window_shift, &tbl);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = try_increment_locked_vm(tbl);
>> +		if (ret) {
>> +			data->ops->remove(data, tbl);
>> +			return ret;
>> +		}
>> +
>> +		create.start_addr = tbl->it_offset << tbl->it_page_shift;
>> +
>> +		if (copy_to_user((void __user *)arg, &create, minsz)) {
>> +			data->ops->remove(data, tbl);
>> +			decrement_locked_vm(tbl);
>> +			return -EFAULT;
>> +		}
>> +		mutex_lock(&container->lock);
>> +		++container->windows_num;
>> +		mutex_unlock(&container->lock);
>> +
>> +		return ret;
>> +	}
>> +	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
>> +		struct vfio_iommu_spapr_tce_remove remove;
>> +		struct spapr_tce_iommu_group *data;
>> +		struct iommu_table *tbl;
>> +
>> +		if (WARN_ON(!container->grp))
>> +			return -ENXIO;
>> +
>> +		data = iommu_group_get_iommudata(container->grp);
>> +
>> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
>> +				start_addr);
>> +
>> +		if (copy_from_user(&remove, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (remove.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		if (remove.flags)
>> +			return -EINVAL;
>> +
>> +		if (!data->ops->remove || !data->iommu_owner)
> 
> On this one we don't both to get data/data->ops.  Is there also an
> exploit where the user can call these CREATE/REMOVE interfaces even
> though INFO doesn't expose them if only a partial set of callbacks are
> present?

		if (!data || !data->ops || !data->ops->remove || !data->iommu_owner)

should do it, right?
And I am not going to add create() without remove(), may be it is worth
adding a compile time check for that.


> 
>> +			return -ENOSYS;
>> +
>> +		tbl = spapr_tce_find_table(container, data, remove.start_addr);
> 
> What happens if this returns the 0 index rather than the expected 1
> index table?  Why doesn't this call ops->find_table()?

Why ops->find_table()? They are different (->find_table() searches for the
window by number, spapr_tce_find_table() searches by address), I do not
understand this comment.

And removing window#0 is supported.


> 
>> +		if (!tbl)
>> +			return -EINVAL;
>> +
>> +		ret = data->ops->remove(data, tbl);
>> +		if (ret)
>> +			return ret;
>> +
>> +		decrement_locked_vm(tbl);
>> +
>> +		mutex_lock(&container->lock);
>> +		--container->windows_num;
>> +		mutex_unlock(&container->lock);
>> +
>> +		return 0;
>> +	}
>>  	}
>>  
>>  	return -ENOTTY;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 6612974..e71a6ef 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -451,9 +451,13 @@ struct vfio_iommu_type1_dma_unmap {
>>   */
>>  struct vfio_iommu_spapr_tce_info {
>>  	__u32 argsz;
>> -	__u32 flags;			/* reserved for future use */
>> +	__u32 flags;
>> +#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
>>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
>> +	__u32 current_windows;
>> +	__u32 windows_available;
>> +	__u32 page_size_mask;
>>  };
>>  
>>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>> @@ -489,6 +493,25 @@ struct vfio_eeh_pe_op {
>>  
>>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>>  
>> +struct vfio_iommu_spapr_tce_create {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +	/* in */
>> +	__u32 page_shift;
>> +	__u32 window_shift;
>> +	/* out */
>> +	__u64 start_addr;
>> +};
>> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
>> +
>> +struct vfio_iommu_spapr_tce_remove {
>> +	__u32 argsz;
>> +	__u32 flags;
>> +	/* in */
>> +	__u64 start_addr;
>> +};
>> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
>> +
> 
> Zero comments, no good.

Right. I'll fix it. Thanks for the review.


> 
>>  /* ***************************************************************** */
>>  
>>  #endif /* _UAPIVFIO_H */
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-10-10 18:33 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23  3:00 [PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
2014-09-23 20:42   ` Alex Williamson
2014-09-25  5:19     ` Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
2014-09-23 21:00   ` Alex Williamson
2014-09-23  3:00 ` [PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 08/13] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
2014-09-23  3:00 ` [PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
2014-09-23  3:01 ` [PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
2014-09-23  3:01 ` [PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
2014-09-23  3:01 ` [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
2014-09-23 21:56   ` Alex Williamson
2014-10-10 18:33     ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).