All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/24] powerpc/iommu/vfio: Enable Dynamic DMA windows
@ 2015-01-29  9:21 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel


This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 1<<59.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 1<<59 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.

Changes:
v3:
* (!) redesigned the whole thing
* multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
no problems with locked_vm counting; also we save memory on actual tables
* guest RAM preregistration is required for DDW
* PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
we do not bother with iommu_table::it_map anymore
* added multilevel TCE tables support to support really huge guests

v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise




Alexey Kardashevskiy (24):
  vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
    driver
  vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/iommu: Introduce iommu_table_alloc() helper
  powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/powernv/ioda2: Rework IOMMU ownership control
  powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
  powerpc/iommu/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  vfio: powerpc/spapr: Register memory
  poweppc/powernv/ioda2: Rework iommu_table creation
  powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table
  powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
  powerpc/iommu: Split iommu_free_table into 2 helpers
  powerpc/powernv: Implement multilevel TCE tables
  powerpc/powernv: Change prototypes to receive iommu
  powerpc/powernv/ioda: Define and implement DMA table/window management
    callbacks
  powerpc/iommu: Get rid of ownership helpers
  vfio/spapr: Enable multiple groups in a container
  vfio: powerpc/spapr: Support Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h            | 107 +++-
 arch/powerpc/include/asm/machdep.h          |  25 -
 arch/powerpc/kernel/eeh.c                   |   2 +-
 arch/powerpc/kernel/iommu.c                 | 282 +++------
 arch/powerpc/kernel/vio.c                   |   5 +
 arch/powerpc/platforms/cell/iommu.c         |   8 +-
 arch/powerpc/platforms/pasemi/iommu.c       |   7 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 470 ++++++++++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  21 +-
 arch/powerpc/platforms/powernv/pci.c        | 130 +++--
 arch/powerpc/platforms/powernv/pci.h        |  14 +-
 arch/powerpc/platforms/pseries/iommu.c      |  99 +++-
 arch/powerpc/sysdev/dart_iommu.c            |  12 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 874 ++++++++++++++++++++++++----
 include/uapi/linux/vfio.h                   |  53 +-
 15 files changed, 1584 insertions(+), 525 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 00/24] powerpc/iommu/vfio: Enable Dynamic DMA windows
@ 2015-01-29  9:21 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel


This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 1<<59.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 1<<59 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.

Changes:
v3:
* (!) redesigned the whole thing
* multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
no problems with locked_vm counting; also we save memory on actual tables
* guest RAM preregistration is required for DDW
* PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
we do not bother with iommu_table::it_map anymore
* added multilevel TCE tables support to support really huge guests

v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise




Alexey Kardashevskiy (24):
  vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
    driver
  vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/iommu: Introduce iommu_table_alloc() helper
  powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/powernv/ioda2: Rework IOMMU ownership control
  powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
  powerpc/iommu/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  vfio: powerpc/spapr: Register memory
  poweppc/powernv/ioda2: Rework iommu_table creation
  powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table
  powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
  powerpc/iommu: Split iommu_free_table into 2 helpers
  powerpc/powernv: Implement multilevel TCE tables
  powerpc/powernv: Change prototypes to receive iommu
  powerpc/powernv/ioda: Define and implement DMA table/window management
    callbacks
  powerpc/iommu: Get rid of ownership helpers
  vfio/spapr: Enable multiple groups in a container
  vfio: powerpc/spapr: Support Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h            | 107 +++-
 arch/powerpc/include/asm/machdep.h          |  25 -
 arch/powerpc/kernel/eeh.c                   |   2 +-
 arch/powerpc/kernel/iommu.c                 | 282 +++------
 arch/powerpc/kernel/vio.c                   |   5 +
 arch/powerpc/platforms/cell/iommu.c         |   8 +-
 arch/powerpc/platforms/pasemi/iommu.c       |   7 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 470 ++++++++++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  21 +-
 arch/powerpc/platforms/powernv/pci.c        | 130 +++--
 arch/powerpc/platforms/powernv/pci.h        |  14 +-
 arch/powerpc/platforms/pseries/iommu.c      |  99 +++-
 arch/powerpc/sysdev/dart_iommu.c            |  12 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 874 ++++++++++++++++++++++++----
 include/uapi/linux/vfio.h                   |  53 +-
 15 files changed, 1584 insertions(+), 525 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 01/24] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |  6 ---
 arch/powerpc/kernel/iommu.c         | 68 ---------------------------
 drivers/vfio/vfio_iommu_spapr_tce.c | 91 +++++++++++++++++++++++++++++++------
 3 files changed, 78 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..45b07f6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -191,16 +191,10 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction);
 extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
 		unsigned long entry);
-extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-		unsigned long entry, unsigned long pages);
-extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
-		unsigned long entry, unsigned long tce);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5d3968c..456acb1 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,19 +903,6 @@ void iommu_register_group(struct iommu_table *tbl,
 	kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
-{
-	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
-		return DMA_BIDIRECTIONAL;
-	else if (tce & TCE_PCI_READ)
-		return DMA_TO_DEVICE;
-	else if (tce & TCE_PCI_WRITE)
-		return DMA_FROM_DEVICE;
-	else
-		return DMA_NONE;
-}
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
-
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
@@ -991,30 +978,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce);
 
-int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-		unsigned long entry, unsigned long pages)
-{
-	unsigned long oldtce;
-	struct page *page;
-
-	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
-			continue;
-
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-
 /*
  * hwaddr is a kernel virtual address here (0xc... bazillion),
  * tce_build converts it to a physical address.
@@ -1044,35 +1007,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_build);
 
-int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
-		unsigned long tce)
-{
-	int ret;
-	struct page *page = NULL;
-	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
-	enum dma_data_direction direction = iommu_tce_direction(tce);
-
-	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-			direction != DMA_TO_DEVICE, &page);
-	if (unlikely(ret != 1)) {
-		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << tbl->it_page_shift, ret); */
-		return -EFAULT;
-	}
-	hwaddr = (unsigned long) page_address(page) + offset;
-
-	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
-	if (ret)
-		put_page(page);
-
-	if (ret < 0)
-		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
-			__func__, entry << tbl->it_page_shift, tce, ret);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1086,7 +1020,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	}
 
 	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
 	/*
 	 * Disable iommu bypass, otherwise the user can DMA to all of
@@ -1104,7 +1037,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 730b4ef..dc4a886 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -147,6 +147,78 @@ static void tce_iommu_release(void *iommu_data)
 	kfree(container);
 }
 
+static int tce_iommu_clear(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long pages)
+{
+	unsigned long oldtce;
+	struct page *page;
+
+	for ( ; pages; --pages, ++entry) {
+		oldtce = iommu_clear_tce(tbl, entry);
+		if (!oldtce)
+			continue;
+
+		page = pfn_to_page(oldtce >> PAGE_SHIFT);
+		WARN_ON(!page);
+		if (page) {
+			if (oldtce & TCE_PCI_WRITE)
+				SetPageDirty(page);
+			put_page(page);
+		}
+	}
+
+	return 0;
+}
+
+static enum dma_data_direction tce_iommu_direction(unsigned long tce)
+{
+	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
+		return DMA_BIDIRECTIONAL;
+	else if (tce & TCE_PCI_READ)
+		return DMA_TO_DEVICE;
+	else if (tce & TCE_PCI_WRITE)
+		return DMA_FROM_DEVICE;
+	else
+		return DMA_NONE;
+}
+
+static long tce_iommu_build(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce, unsigned long pages)
+{
+	long i, ret = 0;
+	struct page *page = NULL;
+	unsigned long hva;
+	enum dma_data_direction direction = tce_iommu_direction(tce);
+
+	for (i = 0; i < pages; ++i) {
+		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+				direction != DMA_TO_DEVICE, &page);
+		if (unlikely(ret != 1)) {
+			ret = -EFAULT;
+			break;
+		}
+		hva = (unsigned long) page_address(page) +
+			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+
+		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
+		if (ret) {
+			put_page(page);
+			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+					__func__, entry << tbl->it_page_shift,
+					tce, ret);
+			break;
+		}
+		tce += IOMMU_PAGE_SIZE_4K;
+	}
+
+	if (ret)
+		tce_iommu_clear(container, tbl, entry, i);
+
+	return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
@@ -195,7 +267,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
 		struct iommu_table *tbl = container->tbl;
-		unsigned long tce, i;
+		unsigned long tce;
 
 		if (!tbl)
 			return -ENXIO;
@@ -229,17 +301,9 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
-			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
-					tce);
-			if (ret)
-				break;
-			tce += IOMMU_PAGE_SIZE_4K;
-		}
-		if (ret)
-			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+		ret = tce_iommu_build(container, tbl,
+				param.iova >> IOMMU_PAGE_SHIFT_4K,
+				tce, param.size >> IOMMU_PAGE_SHIFT_4K);
 
 		iommu_flush_tce(tbl);
 
@@ -273,7 +337,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		ret = iommu_clear_tces_and_put_pages(tbl,
+		ret = tce_iommu_clear(container, tbl,
 				param.iova >> IOMMU_PAGE_SHIFT_4K,
 				param.size >> IOMMU_PAGE_SHIFT_4K);
 		iommu_flush_tce(tbl);
@@ -357,6 +421,7 @@ static void tce_iommu_detach_group(void *iommu_data,
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
 		container->tbl = NULL;
+		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 01/24] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |  6 ---
 arch/powerpc/kernel/iommu.c         | 68 ---------------------------
 drivers/vfio/vfio_iommu_spapr_tce.c | 91 +++++++++++++++++++++++++++++++------
 3 files changed, 78 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..45b07f6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -191,16 +191,10 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction);
 extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
 		unsigned long entry);
-extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-		unsigned long entry, unsigned long pages);
-extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
-		unsigned long entry, unsigned long tce);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5d3968c..456acb1 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,19 +903,6 @@ void iommu_register_group(struct iommu_table *tbl,
 	kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
-{
-	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
-		return DMA_BIDIRECTIONAL;
-	else if (tce & TCE_PCI_READ)
-		return DMA_TO_DEVICE;
-	else if (tce & TCE_PCI_WRITE)
-		return DMA_FROM_DEVICE;
-	else
-		return DMA_NONE;
-}
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
-
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
@@ -991,30 +978,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce);
 
-int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-		unsigned long entry, unsigned long pages)
-{
-	unsigned long oldtce;
-	struct page *page;
-
-	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
-			continue;
-
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-
 /*
  * hwaddr is a kernel virtual address here (0xc... bazillion),
  * tce_build converts it to a physical address.
@@ -1044,35 +1007,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_build);
 
-int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
-		unsigned long tce)
-{
-	int ret;
-	struct page *page = NULL;
-	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
-	enum dma_data_direction direction = iommu_tce_direction(tce);
-
-	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-			direction != DMA_TO_DEVICE, &page);
-	if (unlikely(ret != 1)) {
-		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << tbl->it_page_shift, ret); */
-		return -EFAULT;
-	}
-	hwaddr = (unsigned long) page_address(page) + offset;
-
-	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
-	if (ret)
-		put_page(page);
-
-	if (ret < 0)
-		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
-			__func__, entry << tbl->it_page_shift, tce, ret);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-
 int iommu_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1086,7 +1020,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	}
 
 	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
 	/*
 	 * Disable iommu bypass, otherwise the user can DMA to all of
@@ -1104,7 +1037,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 730b4ef..dc4a886 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -147,6 +147,78 @@ static void tce_iommu_release(void *iommu_data)
 	kfree(container);
 }
 
+static int tce_iommu_clear(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long pages)
+{
+	unsigned long oldtce;
+	struct page *page;
+
+	for ( ; pages; --pages, ++entry) {
+		oldtce = iommu_clear_tce(tbl, entry);
+		if (!oldtce)
+			continue;
+
+		page = pfn_to_page(oldtce >> PAGE_SHIFT);
+		WARN_ON(!page);
+		if (page) {
+			if (oldtce & TCE_PCI_WRITE)
+				SetPageDirty(page);
+			put_page(page);
+		}
+	}
+
+	return 0;
+}
+
+static enum dma_data_direction tce_iommu_direction(unsigned long tce)
+{
+	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
+		return DMA_BIDIRECTIONAL;
+	else if (tce & TCE_PCI_READ)
+		return DMA_TO_DEVICE;
+	else if (tce & TCE_PCI_WRITE)
+		return DMA_FROM_DEVICE;
+	else
+		return DMA_NONE;
+}
+
+static long tce_iommu_build(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce, unsigned long pages)
+{
+	long i, ret = 0;
+	struct page *page = NULL;
+	unsigned long hva;
+	enum dma_data_direction direction = tce_iommu_direction(tce);
+
+	for (i = 0; i < pages; ++i) {
+		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+				direction != DMA_TO_DEVICE, &page);
+		if (unlikely(ret != 1)) {
+			ret = -EFAULT;
+			break;
+		}
+		hva = (unsigned long) page_address(page) +
+			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+
+		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
+		if (ret) {
+			put_page(page);
+			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+					__func__, entry << tbl->it_page_shift,
+					tce, ret);
+			break;
+		}
+		tce += IOMMU_PAGE_SIZE_4K;
+	}
+
+	if (ret)
+		tce_iommu_clear(container, tbl, entry, i);
+
+	return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
@@ -195,7 +267,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
 		struct iommu_table *tbl = container->tbl;
-		unsigned long tce, i;
+		unsigned long tce;
 
 		if (!tbl)
 			return -ENXIO;
@@ -229,17 +301,9 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
-			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
-					tce);
-			if (ret)
-				break;
-			tce += IOMMU_PAGE_SIZE_4K;
-		}
-		if (ret)
-			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+		ret = tce_iommu_build(container, tbl,
+				param.iova >> IOMMU_PAGE_SHIFT_4K,
+				tce, param.size >> IOMMU_PAGE_SHIFT_4K);
 
 		iommu_flush_tce(tbl);
 
@@ -273,7 +337,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		ret = iommu_clear_tces_and_put_pages(tbl,
+		ret = tce_iommu_clear(container, tbl,
 				param.iova >> IOMMU_PAGE_SHIFT_4K,
 				param.size >> IOMMU_PAGE_SHIFT_4K);
 		iommu_flush_tce(tbl);
@@ -357,6 +421,7 @@ static void tce_iommu_detach_group(void *iommu_data,
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
 		container->tbl = NULL;
+		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 02/24] vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* check is done for all page sizes now, not just for huge pages
* failed check returns EFAULT now (was EINVAL)
* moved the check to VFIO SPAPR IOMMU driver
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index dc4a886..99b98fa 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,22 @@ struct tce_container {
 	bool enabled;
 };
 
+static bool tce_check_page_size(struct page *page, unsigned page_shift)
+{
+	unsigned shift;
+
+	/*
+	 * Check that the TCE table granularity is not bigger than the size of
+	 * a page we just found. Otherwise the hardware can get access to
+	 * a bigger memory chunk that it should.
+	 */
+	shift = PAGE_SHIFT + compound_order(compound_head(page));
+	if (shift >= page_shift)
+		return true;
+
+	return false;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
@@ -199,6 +215,12 @@ static long tce_iommu_build(struct tce_container *container,
 			ret = -EFAULT;
 			break;
 		}
+
+		if (!tce_check_page_size(page, tbl->it_page_shift)) {
+			ret = -EFAULT;
+			break;
+		}
+
 		hva = (unsigned long) page_address(page) +
 			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 02/24] vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* check is done for all page sizes now, not just for huge pages
* failed check returns EFAULT now (was EINVAL)
* moved the check to VFIO SPAPR IOMMU driver
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index dc4a886..99b98fa 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,22 @@ struct tce_container {
 	bool enabled;
 };
 
+static bool tce_check_page_size(struct page *page, unsigned page_shift)
+{
+	unsigned shift;
+
+	/*
+	 * Check that the TCE table granularity is not bigger than the size of
+	 * a page we just found. Otherwise the hardware can get access to
+	 * a bigger memory chunk that it should.
+	 */
+	shift = PAGE_SHIFT + compound_order(compound_head(page));
+	if (shift >= page_shift)
+		return true;
+
+	return false;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
@@ -199,6 +215,12 @@ static long tce_iommu_build(struct tce_container *container,
 			ret = -EFAULT;
 			break;
 		}
+
+		if (!tce_check_page_size(page, tbl->it_page_shift)) {
+			ret = -EFAULT;
+			break;
+		}
+
 		hva = (unsigned long) page_address(page) +
 			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 03/24] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/pci.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..9ec7d68 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -589,19 +589,27 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
+{
+	switch (direction) {
+	case DMA_BIDIRECTIONAL:
+	case DMA_FROM_DEVICE:
+		return TCE_PCI_READ | TCE_PCI_WRITE;
+	case DMA_TO_DEVICE:
+		return TCE_PCI_READ;
+	default:
+		return 0;
+	}
+}
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
 {
-	u64 proto_tce;
+	u64 proto_tce = pnv_dmadir_to_flags(direction);
 	__be64 *tcep, *tces;
 	u64 rpn;
 
-	proto_tce = TCE_PCI_READ; // Read allowed
-
-	if (direction != DMA_TO_DEVICE)
-		proto_tce |= TCE_PCI_WRITE;
-
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
 	rpn = __pa(uaddr) >> tbl->it_page_shift;
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 03/24] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/powernv/pci.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..9ec7d68 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -589,19 +589,27 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
+{
+	switch (direction) {
+	case DMA_BIDIRECTIONAL:
+	case DMA_FROM_DEVICE:
+		return TCE_PCI_READ | TCE_PCI_WRITE;
+	case DMA_TO_DEVICE:
+		return TCE_PCI_READ;
+	default:
+		return 0;
+	}
+}
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
 {
-	u64 proto_tce;
+	u64 proto_tce = pnv_dmadir_to_flags(direction);
 	__be64 *tcep, *tces;
 	u64 rpn;
 
-	proto_tce = TCE_PCI_READ; // Read allowed
-
-	if (direction != DMA_TO_DEVICE)
-		proto_tce |= TCE_PCI_WRITE;
-
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
 	rpn = __pa(uaddr) >> tbl->it_page_shift;
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 04/24] vfio: powerpc/spapr: Use it_page_size
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 99b98fa..c596053 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -97,7 +97,7 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * enforcing the limit based on the max that the guest can map.
 	 */
 	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 	locked = current->mm->locked_vm + npages;
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -126,7 +126,7 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	down_write(&current->mm->mmap_sem);
 	current->mm->locked_vm -= (container->tbl->it_size <<
-			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+			container->tbl->it_page_shift) >> PAGE_SHIFT;
 	up_write(&current->mm->mmap_sem);
 }
 
@@ -232,7 +232,7 @@ static long tce_iommu_build(struct tce_container *container,
 					tce, ret);
 			break;
 		}
-		tce += IOMMU_PAGE_SIZE_4K;
+		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
 	if (ret)
@@ -277,8 +277,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -308,8 +308,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
-		if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
-				(param.vaddr & ~IOMMU_PAGE_MASK_4K))
+		if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+				(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
 			return -EINVAL;
 
 		/* iova is checked by the IOMMU API */
@@ -324,8 +324,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 			return ret;
 
 		ret = tce_iommu_build(container, tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				tce, param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				tce, param.size >> tbl->it_page_shift);
 
 		iommu_flush_tce(tbl);
 
@@ -351,17 +351,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
-		if (param.size & ~IOMMU_PAGE_MASK_4K)
+		if (param.size & ~IOMMU_PAGE_MASK(tbl))
 			return -EINVAL;
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.size >> tbl->it_page_shift);
 		if (ret)
 			return ret;
 
 		ret = tce_iommu_clear(container, tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				param.size >> tbl->it_page_shift);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 04/24] vfio: powerpc/spapr: Use it_page_size
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 99b98fa..c596053 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -97,7 +97,7 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * enforcing the limit based on the max that the guest can map.
 	 */
 	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 	locked = current->mm->locked_vm + npages;
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -126,7 +126,7 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	down_write(&current->mm->mmap_sem);
 	current->mm->locked_vm -= (container->tbl->it_size <<
-			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+			container->tbl->it_page_shift) >> PAGE_SHIFT;
 	up_write(&current->mm->mmap_sem);
 }
 
@@ -232,7 +232,7 @@ static long tce_iommu_build(struct tce_container *container,
 					tce, ret);
 			break;
 		}
-		tce += IOMMU_PAGE_SIZE_4K;
+		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
 	if (ret)
@@ -277,8 +277,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -308,8 +308,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
-		if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
-				(param.vaddr & ~IOMMU_PAGE_MASK_4K))
+		if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+				(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
 			return -EINVAL;
 
 		/* iova is checked by the IOMMU API */
@@ -324,8 +324,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 			return ret;
 
 		ret = tce_iommu_build(container, tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				tce, param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				tce, param.size >> tbl->it_page_shift);
 
 		iommu_flush_tce(tbl);
 
@@ -351,17 +351,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
-		if (param.size & ~IOMMU_PAGE_MASK_4K)
+		if (param.size & ~IOMMU_PAGE_MASK(tbl))
 			return -EINVAL;
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.size >> tbl->it_page_shift);
 		if (ret)
 			return ret;
 
 		ret = tce_iommu_clear(container, tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				param.size >> tbl->it_page_shift);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 05/24] vfio: powerpc/spapr: Move locked_vm accounting to helpers
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 72 +++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c596053..29d5708 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,47 @@
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
+#define IOMMU_TABLE_PAGES(tbl) \
+		(((tbl)->it_size << (tbl)->it_page_shift) >> PAGE_SHIFT)
+
+static long try_increment_locked_vm(long npages)
+{
+	long ret = 0, locked, lock_limit;
+
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("[%d] RLIMIT_MEMLOCK (%ld) exceeded\n",
+				current->pid, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	pr_debug("[%d] RLIMIT_MEMLOCK+ %ld pages\n", current->pid,
+			current->mm->locked_vm);
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+static void decrement_locked_vm(long npages)
+{
+	if (!current || !current->mm)
+		return; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	if (npages > current->mm->locked_vm)
+		npages = current->mm->locked_vm;
+	current->mm->locked_vm -= npages;
+	pr_debug("[%d] RLIMIT_MEMLOCK- %ld pages\n", current->pid,
+			current->mm->locked_vm);
+	up_write(&current->mm->mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -66,8 +107,6 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked, lock_limit, npages;
-	struct iommu_table *tbl = container->tbl;
 
 	if (!container->tbl)
 		return -ENXIO;
@@ -95,21 +134,19 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
+	 *
+	 * Unfortunately at the moment it counts whole tables, no matter how
+	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+	 * each with 2GB DMA window, 8GB will be counted here. The reason for
+	 * this is that we cannot tell here the amount of RAM used by the guest
+	 * as this information is only available from KVM and VFIO is
+	 * KVM agnostic.
 	 */
-	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
-	locked = current->mm->locked_vm + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-				rlimit(RLIMIT_MEMLOCK));
-		ret = -ENOMEM;
-	} else {
+	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	if (ret)
+		return ret;
 
-		current->mm->locked_vm += npages;
-		container->enabled = true;
-	}
-	up_write(&current->mm->mmap_sem);
+	container->enabled = true;
 
 	return ret;
 }
@@ -124,10 +161,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!container->tbl || !current->mm)
 		return;
 
-	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (container->tbl->it_size <<
-			container->tbl->it_page_shift) >> PAGE_SHIFT;
-	up_write(&current->mm->mmap_sem);
+	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 05/24] vfio: powerpc/spapr: Move locked_vm accounting to helpers
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 72 +++++++++++++++++++++++++++----------
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c596053..29d5708 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,47 @@
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
+#define IOMMU_TABLE_PAGES(tbl) \
+		(((tbl)->it_size << (tbl)->it_page_shift) >> PAGE_SHIFT)
+
+static long try_increment_locked_vm(long npages)
+{
+	long ret = 0, locked, lock_limit;
+
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("[%d] RLIMIT_MEMLOCK (%ld) exceeded\n",
+				current->pid, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	pr_debug("[%d] RLIMIT_MEMLOCK+ %ld pages\n", current->pid,
+			current->mm->locked_vm);
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+static void decrement_locked_vm(long npages)
+{
+	if (!current || !current->mm)
+		return; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	if (npages > current->mm->locked_vm)
+		npages = current->mm->locked_vm;
+	current->mm->locked_vm -= npages;
+	pr_debug("[%d] RLIMIT_MEMLOCK- %ld pages\n", current->pid,
+			current->mm->locked_vm);
+	up_write(&current->mm->mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -66,8 +107,6 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked, lock_limit, npages;
-	struct iommu_table *tbl = container->tbl;
 
 	if (!container->tbl)
 		return -ENXIO;
@@ -95,21 +134,19 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
+	 *
+	 * Unfortunately at the moment it counts whole tables, no matter how
+	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+	 * each with 2GB DMA window, 8GB will be counted here. The reason for
+	 * this is that we cannot tell here the amount of RAM used by the guest
+	 * as this information is only available from KVM and VFIO is
+	 * KVM agnostic.
 	 */
-	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
-	locked = current->mm->locked_vm + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-				rlimit(RLIMIT_MEMLOCK));
-		ret = -ENOMEM;
-	} else {
+	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	if (ret)
+		return ret;
 
-		current->mm->locked_vm += npages;
-		container->enabled = true;
-	}
-	up_write(&current->mm->mmap_sem);
+	container->enabled = true;
 
 	return ret;
 }
@@ -124,10 +161,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!container->tbl || !current->mm)
 		return;
 
-	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (container->tbl->it_size <<
-			container->tbl->it_page_shift) >> PAGE_SHIFT;
-	up_write(&current->mm->mmap_sem);
+	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 06/24] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables used by VFIO.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundand prefixes.

This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode.

For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            | 17 +++++++++++
 arch/powerpc/include/asm/machdep.h          | 25 ----------------
 arch/powerpc/kernel/iommu.c                 | 46 +++++++++++++++--------------
 arch/powerpc/kernel/vio.c                   |  5 ++++
 arch/powerpc/platforms/cell/iommu.c         |  8 +++--
 arch/powerpc/platforms/pasemi/iommu.c       |  7 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  2 ++
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  1 +
 arch/powerpc/platforms/powernv/pci.c        | 23 ++++-----------
 arch/powerpc/platforms/powernv/pci.h        |  1 +
 arch/powerpc/platforms/pseries/iommu.c      | 34 +++++++++++----------
 arch/powerpc/sysdev/dart_iommu.c            | 12 ++++----
 12 files changed, 93 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 45b07f6..eb5822d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+	int (*set)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
+	void (*clear)(struct iommu_table *tbl,
+			long index, long npages);
+	unsigned long (*get)(struct iommu_table *tbl, long index);
+	void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	struct iommu_table_ops *it_ops;
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..2abe744 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
 
-	int		(*tce_build)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	unsigned long	(*tce_get)(struct iommu_table *tbl,
-				    long index);
-	void		(*tce_flush)(struct iommu_table *tbl);
-
-	/* _rm versions are for real mode use only */
-	int		(*tce_build_rm)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free_rm)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	void		(*tce_flush_rm)(struct iommu_table *tbl);
-
 	void __iomem *	(*ioremap)(phys_addr_t addr, unsigned long size,
 				   unsigned long flags, void *caller);
 	void		(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 456acb1..c51ad3e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
-	build_fail = ppc_md.tce_build(tbl, entry, npages,
+	build_fail = tbl->it_ops->set(tbl, entry, npages,
 				      (unsigned long)page &
 				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
-	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	/* tbl->it_ops->set() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
 	 * DMA_ERROR_CODE. For all other errors the functionality is
 	 * not altered.
@@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	if (!iommu_free_check(tbl, dma_addr, npages))
 		return;
 
-	ppc_md.tce_free(tbl, entry, npages);
+	tbl->it_ops->clear(tbl, entry, npages);
 
 	spin_lock_irqsave(&(pool->lock), flags);
 	bitmap_clear(tbl->it_map, free_entry, npages);
@@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	 * not do an mb() here on purpose, it is not needed on any of
 	 * the current platforms.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -495,7 +495,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		build_fail = ppc_md.tce_build(tbl, entry, npages,
+		build_fail = tbl->it_ops->set(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
 					      direction, attrs);
 		if(unlikely(build_fail))
@@ -534,8 +534,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	DBG("mapped %d elements:\n", outcount);
 
@@ -600,8 +600,8 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 	 * do not do an mb() here, the affected platforms do not need it
 	 * when freeing.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 static void iommu_table_clear(struct iommu_table *tbl)
@@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
 	 */
 	if (!is_kdump_kernel() || is_fadump_active()) {
 		/* Clear the table in case firmware left allocations in it */
-		ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+		tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
 		return;
 	}
 
 #ifdef CONFIG_CRASH_DUMP
-	if (ppc_md.tce_get) {
+	if (tbl->it_ops->get) {
 		unsigned long index, tceval, tcecount = 0;
 
 		/* Reserve the existing mappings left by the first kernel. */
 		for (index = 0; index < tbl->it_size; index++) {
-			tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+			tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
 			/*
 			 * Freed TCE entry contains 0x7fffffffffffffff on JS20
 			 */
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	unsigned int i;
 	struct iommu_pool *p;
 
+	BUG_ON(!tbl->it_ops);
+
 	/* number of bytes needed for the bitmap */
 	sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
@@ -906,8 +908,8 @@ void iommu_register_group(struct iommu_table *tbl,
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -918,7 +920,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce_value,
 		unsigned long npages)
 {
-	/* ppc_md.tce_free() does not support any value but 0 */
+	/* tbl->it_ops->clear() does not support any value but 0 */
 	if (tce_value)
 		return -EINVAL;
 
@@ -966,9 +968,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		ppc_md.tce_free(tbl, entry, 1);
+		tbl->it_ops->clear(tbl, entry, 1);
 	else
 		oldtce = 0;
 
@@ -991,10 +993,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	/* Add new entry if it is not busy */
 	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
 
 	spin_unlock(&(pool->lock));
 
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5bfdab9..b41426c 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,6 +1196,11 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
 
+	if (firmware_has_feature(FW_FEATURE_LPAR))
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
+	else
+		tbl->it_ops = &iommu_table_pseries_ops;
+
 	return iommu_init_table(tbl, -1);
 }
 
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index c7c8720..72763a8 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
 	return *ioid;
 }
 
+static struct iommu_table_ops cell_iommu_ops = {
+	.set = tce_build_cell,
+	.clear = tce_free_cell
+};
+
 static struct iommu_window * __init
 cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 			unsigned long offset, unsigned long size,
@@ -491,6 +496,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 	window->table.it_offset =
 		(offset >> window->table.it_page_shift) + pte_offset;
 	window->table.it_size = size >> window->table.it_page_shift;
+	window->table.it_ops = &cell_iommu_ops;
 
 	iommu_init_table(&window->table, iommu->nid);
 
@@ -1200,8 +1206,6 @@ static int __init cell_iommu_init(void)
 	/* Setup various ppc_md. callbacks */
 	ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
 	ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
-	ppc_md.tce_build = tce_build_cell;
-	ppc_md.tce_free = tce_free_cell;
 
 	if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
 		goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 2e576f2..b7245b2 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
 	}
 }
 
+static struct iommu_table_ops iommu_table_iobmap_ops = {
+	.set = iobmap_build,
+	.clear  = iobmap_free
+};
 
 static void iommu_table_iobmap_setup(void)
 {
@@ -151,6 +155,7 @@ static void iommu_table_iobmap_setup(void)
 	 * Should probably be 8 (64 bytes)
 	 */
 	iommu_table_iobmap.it_blocksize = 4;
+	iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
 	iommu_init_table(&iommu_table_iobmap, 0);
 	pr_debug(" <- %s\n", __func__);
 }
@@ -250,8 +255,6 @@ void __init iommu_init_early_pasemi(void)
 
 	ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
 	ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
-	ppc_md.tce_build = iobmap_build;
-	ppc_md.tce_free  = iobmap_free;
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5c74333..af7a689 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1208,6 +1208,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
@@ -1341,6 +1342,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 6ef6d4d..0256fcc 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,6 +87,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
+		phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 9ec7d68..c4782b1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -660,18 +660,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
 }
 
-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
-{
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
-	pnv_tce_free(tbl, index, npages, true);
-}
+struct iommu_table_ops pnv_iommu_ops = {
+	.set = pnv_tce_build_vm,
+	.clear = pnv_tce_free_vm,
+	.get = pnv_tce_get,
+};
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
@@ -705,6 +698,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, hose->node);
 	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
 
@@ -859,11 +853,6 @@ void __init pnv_pci_init(void)
 
 	/* Configure IOMMU DMA hooks */
 	ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
-	ppc_md.tce_build = pnv_tce_build_vm;
-	ppc_md.tce_free = pnv_tce_free_vm;
-	ppc_md.tce_build_rm = pnv_tce_build_rm;
-	ppc_md.tce_free_rm = pnv_tce_free_rm;
-	ppc_md.tce_get = pnv_tce_get;
 	ppc_md.pci_probe_mode = pnv_pci_probe_mode;
 	set_pci_dma_ops(&dma_iommu_ops);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..f726700 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,6 +216,7 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
+extern struct iommu_table_ops pnv_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 1d3d52d..1aa1815 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -192,7 +192,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	int ret = 0;
 	unsigned long flags;
 
-	if (npages == 1) {
+	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 		                           direction, attrs);
 	}
@@ -284,6 +284,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
 {
 	u64 rc;
 
+	if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+		return tce_free_pSeriesLP(tbl, tcenum, npages);
+
 	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
 
 	if (rc && printk_ratelimit()) {
@@ -459,7 +462,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
 	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
-
 #ifdef CONFIG_PCI
 static void iommu_table_setparms(struct pci_controller *phb,
 				 struct device_node *dn,
@@ -545,6 +547,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	tbl->it_size = size >> tbl->it_page_shift;
 }
 
+struct iommu_table_ops iommu_table_pseries_ops = {
+	.set = tce_build_pSeries,
+	.clear = tce_free_pSeries,
+	.get = tce_get_pseries
+};
+
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 {
 	struct device_node *dn;
@@ -613,6 +621,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 			   pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
+	tbl->it_ops = &iommu_table_pseries_ops;
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
 	iommu_register_group(tbl, pci_domain_nr(bus), 0);
 
@@ -624,6 +633,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+	.set = tce_buildmulti_pSeriesLP,
+	.clear = tce_freemulti_pSeriesLP,
+	.get = tce_get_pSeriesLP
+};
 
 static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 {
@@ -658,6 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
 		iommu_register_group(tbl, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
@@ -685,6 +700,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
+		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
 		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
@@ -1107,6 +1123,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
 		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
@@ -1299,22 +1316,11 @@ void iommu_init_early_pSeries(void)
 		return;
 
 	if (firmware_has_feature(FW_FEATURE_LPAR)) {
-		if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
-			ppc_md.tce_build = tce_buildmulti_pSeriesLP;
-			ppc_md.tce_free	 = tce_freemulti_pSeriesLP;
-		} else {
-			ppc_md.tce_build = tce_build_pSeriesLP;
-			ppc_md.tce_free	 = tce_free_pSeriesLP;
-		}
-		ppc_md.tce_get   = tce_get_pSeriesLP;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
 		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
 		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
 	} else {
-		ppc_md.tce_build = tce_build_pSeries;
-		ppc_md.tce_free  = tce_free_pSeries;
-		ppc_md.tce_get   = tce_get_pseries;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
 	}
@@ -1332,8 +1338,6 @@ static int __init disable_multitce(char *str)
 	    firmware_has_feature(FW_FEATURE_LPAR) &&
 	    firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
-		ppc_md.tce_build = tce_build_pSeriesLP;
-		ppc_md.tce_free	 = tce_free_pSeriesLP;
 		powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
 	}
 	return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..ab361a3 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
 	return 0;
 }
 
+static struct iommu_table_ops iommu_dart_ops = {
+	.set = dart_build,
+	.clear = dart_free,
+	.flush = dart_flush,
+};
+
 static void iommu_table_dart_setup(void)
 {
 	iommu_table_dart.it_busno = 0;
@@ -298,6 +304,7 @@ static void iommu_table_dart_setup(void)
 	iommu_table_dart.it_base = (unsigned long)dart_vbase;
 	iommu_table_dart.it_index = 0;
 	iommu_table_dart.it_blocksize = 1;
+	iommu_table_dart.it_ops = &iommu_dart_ops;
 	iommu_init_table(&iommu_table_dart, -1);
 
 	/* Reserve the last page of the DART to avoid possible prefetch
@@ -386,11 +393,6 @@ void __init iommu_init_early_dart(void)
 	if (dart_init(dn) != 0)
 		goto bail;
 
-	/* Setup low level TCE operations for the core IOMMU code */
-	ppc_md.tce_build = dart_build;
-	ppc_md.tce_free  = dart_free;
-	ppc_md.tce_flush = dart_flush;
-
 	/* Setup bypass if supported */
 	if (dart_is_u4)
 		ppc_md.dma_set_mask = dart_dma_set_mask;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 06/24] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables used by VFIO.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundand prefixes.

This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode.

For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            | 17 +++++++++++
 arch/powerpc/include/asm/machdep.h          | 25 ----------------
 arch/powerpc/kernel/iommu.c                 | 46 +++++++++++++++--------------
 arch/powerpc/kernel/vio.c                   |  5 ++++
 arch/powerpc/platforms/cell/iommu.c         |  8 +++--
 arch/powerpc/platforms/pasemi/iommu.c       |  7 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  2 ++
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  1 +
 arch/powerpc/platforms/powernv/pci.c        | 23 ++++-----------
 arch/powerpc/platforms/powernv/pci.h        |  1 +
 arch/powerpc/platforms/pseries/iommu.c      | 34 +++++++++++----------
 arch/powerpc/sysdev/dart_iommu.c            | 12 ++++----
 12 files changed, 93 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 45b07f6..eb5822d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+	int (*set)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
+	void (*clear)(struct iommu_table *tbl,
+			long index, long npages);
+	unsigned long (*get)(struct iommu_table *tbl, long index);
+	void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	struct iommu_table_ops *it_ops;
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..2abe744 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
 
-	int		(*tce_build)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	unsigned long	(*tce_get)(struct iommu_table *tbl,
-				    long index);
-	void		(*tce_flush)(struct iommu_table *tbl);
-
-	/* _rm versions are for real mode use only */
-	int		(*tce_build_rm)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free_rm)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	void		(*tce_flush_rm)(struct iommu_table *tbl);
-
 	void __iomem *	(*ioremap)(phys_addr_t addr, unsigned long size,
 				   unsigned long flags, void *caller);
 	void		(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 456acb1..c51ad3e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
-	build_fail = ppc_md.tce_build(tbl, entry, npages,
+	build_fail = tbl->it_ops->set(tbl, entry, npages,
 				      (unsigned long)page &
 				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
-	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	/* tbl->it_ops->set() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
 	 * DMA_ERROR_CODE. For all other errors the functionality is
 	 * not altered.
@@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	if (!iommu_free_check(tbl, dma_addr, npages))
 		return;
 
-	ppc_md.tce_free(tbl, entry, npages);
+	tbl->it_ops->clear(tbl, entry, npages);
 
 	spin_lock_irqsave(&(pool->lock), flags);
 	bitmap_clear(tbl->it_map, free_entry, npages);
@@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	 * not do an mb() here on purpose, it is not needed on any of
 	 * the current platforms.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -495,7 +495,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		build_fail = ppc_md.tce_build(tbl, entry, npages,
+		build_fail = tbl->it_ops->set(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
 					      direction, attrs);
 		if(unlikely(build_fail))
@@ -534,8 +534,8 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	DBG("mapped %d elements:\n", outcount);
 
@@ -600,8 +600,8 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 	 * do not do an mb() here, the affected platforms do not need it
 	 * when freeing.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 static void iommu_table_clear(struct iommu_table *tbl)
@@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
 	 */
 	if (!is_kdump_kernel() || is_fadump_active()) {
 		/* Clear the table in case firmware left allocations in it */
-		ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+		tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
 		return;
 	}
 
 #ifdef CONFIG_CRASH_DUMP
-	if (ppc_md.tce_get) {
+	if (tbl->it_ops->get) {
 		unsigned long index, tceval, tcecount = 0;
 
 		/* Reserve the existing mappings left by the first kernel. */
 		for (index = 0; index < tbl->it_size; index++) {
-			tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+			tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
 			/*
 			 * Freed TCE entry contains 0x7fffffffffffffff on JS20
 			 */
@@ -657,6 +657,8 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	unsigned int i;
 	struct iommu_pool *p;
 
+	BUG_ON(!tbl->it_ops);
+
 	/* number of bytes needed for the bitmap */
 	sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
@@ -906,8 +908,8 @@ void iommu_register_group(struct iommu_table *tbl,
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -918,7 +920,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce_value,
 		unsigned long npages)
 {
-	/* ppc_md.tce_free() does not support any value but 0 */
+	/* tbl->it_ops->clear() does not support any value but 0 */
 	if (tce_value)
 		return -EINVAL;
 
@@ -966,9 +968,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		ppc_md.tce_free(tbl, entry, 1);
+		tbl->it_ops->clear(tbl, entry, 1);
 	else
 		oldtce = 0;
 
@@ -991,10 +993,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	/* Add new entry if it is not busy */
 	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
 
 	spin_unlock(&(pool->lock));
 
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 5bfdab9..b41426c 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,6 +1196,11 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
 
+	if (firmware_has_feature(FW_FEATURE_LPAR))
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
+	else
+		tbl->it_ops = &iommu_table_pseries_ops;
+
 	return iommu_init_table(tbl, -1);
 }
 
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index c7c8720..72763a8 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
 	return *ioid;
 }
 
+static struct iommu_table_ops cell_iommu_ops = {
+	.set = tce_build_cell,
+	.clear = tce_free_cell
+};
+
 static struct iommu_window * __init
 cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 			unsigned long offset, unsigned long size,
@@ -491,6 +496,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 	window->table.it_offset =
 		(offset >> window->table.it_page_shift) + pte_offset;
 	window->table.it_size = size >> window->table.it_page_shift;
+	window->table.it_ops = &cell_iommu_ops;
 
 	iommu_init_table(&window->table, iommu->nid);
 
@@ -1200,8 +1206,6 @@ static int __init cell_iommu_init(void)
 	/* Setup various ppc_md. callbacks */
 	ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
 	ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
-	ppc_md.tce_build = tce_build_cell;
-	ppc_md.tce_free = tce_free_cell;
 
 	if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
 		goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 2e576f2..b7245b2 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
 	}
 }
 
+static struct iommu_table_ops iommu_table_iobmap_ops = {
+	.set = iobmap_build,
+	.clear  = iobmap_free
+};
 
 static void iommu_table_iobmap_setup(void)
 {
@@ -151,6 +155,7 @@ static void iommu_table_iobmap_setup(void)
 	 * Should probably be 8 (64 bytes)
 	 */
 	iommu_table_iobmap.it_blocksize = 4;
+	iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
 	iommu_init_table(&iommu_table_iobmap, 0);
 	pr_debug(" <- %s\n", __func__);
 }
@@ -250,8 +255,6 @@ void __init iommu_init_early_pasemi(void)
 
 	ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
 	ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
-	ppc_md.tce_build = iobmap_build;
-	ppc_md.tce_free  = iobmap_free;
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5c74333..af7a689 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1208,6 +1208,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
@@ -1341,6 +1342,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 6ef6d4d..0256fcc 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,6 +87,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
+		phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 9ec7d68..c4782b1 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -660,18 +660,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
 }
 
-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
-{
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
-	pnv_tce_free(tbl, index, npages, true);
-}
+struct iommu_table_ops pnv_iommu_ops = {
+	.set = pnv_tce_build_vm,
+	.clear = pnv_tce_free_vm,
+	.get = pnv_tce_get,
+};
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
@@ -705,6 +698,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
+	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, hose->node);
 	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
 
@@ -859,11 +853,6 @@ void __init pnv_pci_init(void)
 
 	/* Configure IOMMU DMA hooks */
 	ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
-	ppc_md.tce_build = pnv_tce_build_vm;
-	ppc_md.tce_free = pnv_tce_free_vm;
-	ppc_md.tce_build_rm = pnv_tce_build_rm;
-	ppc_md.tce_free_rm = pnv_tce_free_rm;
-	ppc_md.tce_get = pnv_tce_get;
 	ppc_md.pci_probe_mode = pnv_pci_probe_mode;
 	set_pci_dma_ops(&dma_iommu_ops);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..f726700 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,6 +216,7 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
+extern struct iommu_table_ops pnv_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 1d3d52d..1aa1815 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -192,7 +192,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	int ret = 0;
 	unsigned long flags;
 
-	if (npages == 1) {
+	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 		                           direction, attrs);
 	}
@@ -284,6 +284,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
 {
 	u64 rc;
 
+	if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+		return tce_free_pSeriesLP(tbl, tcenum, npages);
+
 	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
 
 	if (rc && printk_ratelimit()) {
@@ -459,7 +462,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
 	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
-
 #ifdef CONFIG_PCI
 static void iommu_table_setparms(struct pci_controller *phb,
 				 struct device_node *dn,
@@ -545,6 +547,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	tbl->it_size = size >> tbl->it_page_shift;
 }
 
+struct iommu_table_ops iommu_table_pseries_ops = {
+	.set = tce_build_pSeries,
+	.clear = tce_free_pSeries,
+	.get = tce_get_pseries
+};
+
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 {
 	struct device_node *dn;
@@ -613,6 +621,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 			   pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
+	tbl->it_ops = &iommu_table_pseries_ops;
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
 	iommu_register_group(tbl, pci_domain_nr(bus), 0);
 
@@ -624,6 +633,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+	.set = tce_buildmulti_pSeriesLP,
+	.clear = tce_freemulti_pSeriesLP,
+	.get = tce_get_pSeriesLP
+};
 
 static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 {
@@ -658,6 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
 		iommu_register_group(tbl, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
@@ -685,6 +700,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
+		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
 		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
@@ -1107,6 +1123,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
+		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
 		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
@@ -1299,22 +1316,11 @@ void iommu_init_early_pSeries(void)
 		return;
 
 	if (firmware_has_feature(FW_FEATURE_LPAR)) {
-		if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
-			ppc_md.tce_build = tce_buildmulti_pSeriesLP;
-			ppc_md.tce_free	 = tce_freemulti_pSeriesLP;
-		} else {
-			ppc_md.tce_build = tce_build_pSeriesLP;
-			ppc_md.tce_free	 = tce_free_pSeriesLP;
-		}
-		ppc_md.tce_get   = tce_get_pSeriesLP;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
 		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
 		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
 	} else {
-		ppc_md.tce_build = tce_build_pSeries;
-		ppc_md.tce_free  = tce_free_pSeries;
-		ppc_md.tce_get   = tce_get_pseries;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
 	}
@@ -1332,8 +1338,6 @@ static int __init disable_multitce(char *str)
 	    firmware_has_feature(FW_FEATURE_LPAR) &&
 	    firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
-		ppc_md.tce_build = tce_build_pSeriesLP;
-		ppc_md.tce_free	 = tce_free_pSeriesLP;
 		powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
 	}
 	return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..ab361a3 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
 	return 0;
 }
 
+static struct iommu_table_ops iommu_dart_ops = {
+	.set = dart_build,
+	.clear = dart_free,
+	.flush = dart_flush,
+};
+
 static void iommu_table_dart_setup(void)
 {
 	iommu_table_dart.it_busno = 0;
@@ -298,6 +304,7 @@ static void iommu_table_dart_setup(void)
 	iommu_table_dart.it_base = (unsigned long)dart_vbase;
 	iommu_table_dart.it_index = 0;
 	iommu_table_dart.it_blocksize = 1;
+	iommu_table_dart.it_ops = &iommu_dart_ops;
 	iommu_init_table(&iommu_table_dart, -1);
 
 	/* Reserve the last page of the DART to avoid possible prefetch
@@ -386,11 +393,6 @@ void __init iommu_init_early_dart(void)
 	if (dart_init(dn) != 0)
 		goto bail;
 
-	/* Setup low level TCE operations for the core IOMMU code */
-	ppc_md.tce_build = dart_build;
-	ppc_md.tce_free  = dart_free;
-	ppc_md.tce_flush = dart_flush;
-
 	/* Setup bypass if supported */
 	if (dart_is_u4)
 		ppc_md.dma_set_mask = dart_dma_set_mask;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 07/24] powerpc/iommu: Introduce iommu_table_alloc() helper
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This replaces multiple calls of kzalloc_node() with a new
iommu_table_alloc() helper. Right now it calls kzalloc_node() but
later it will be modified to allocate a powerpc_iommu struct with
a single iommu_table in it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h       |  1 +
 arch/powerpc/kernel/iommu.c            |  9 +++++++++
 arch/powerpc/platforms/powernv/pci.c   |  2 +-
 arch/powerpc/platforms/pseries/iommu.c | 12 ++++--------
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index eb5822d..335e3d4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -117,6 +117,7 @@ static inline void *get_iommu_table_base(struct device *dev)
 	return dev->archdata.dma_data.iommu_table_base;
 }
 
+extern struct iommu_table *iommu_table_alloc(int node);
 /* Frees table for an individual device node */
 extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c51ad3e..2f7e92b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -710,6 +710,15 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
+struct iommu_table *iommu_table_alloc(int node)
+{
+	struct iommu_table *tbl;
+
+	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+
+	return tbl;
+}
+
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
 	unsigned long bitmap_sz;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c4782b1..bbe529b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -693,7 +693,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		       hose->dn->full_name);
 		return NULL;
 	}
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose->node);
+	tbl = iommu_table_alloc(hose->node);
 	if (WARN_ON(!tbl))
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 1aa1815..bc14299 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -617,8 +617,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pci->phb->dma_window_size = 0x8000000ul;
 	pci->phb->dma_window_base_cur = 0x8000000ul;
 
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-			   pci->phb->node);
+	tbl = iommu_table_alloc(pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
@@ -669,8 +668,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		 pdn->full_name, ppci->iommu_table);
 
 	if (!ppci->iommu_table) {
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   ppci->phb->node);
+		tbl = iommu_table_alloc(ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
@@ -697,8 +695,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		struct pci_controller *phb = PCI_DN(dn)->phb;
 
 		pr_debug(" --> first child, no bridge. Allocating iommu table.\n");
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   phb->node);
+		tbl = iommu_table_alloc(phb->node);
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
@@ -1120,8 +1117,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 
 	pci = PCI_DN(pdn);
 	if (!pci->iommu_table) {
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   pci->phb->node);
+		tbl = iommu_table_alloc(pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 07/24] powerpc/iommu: Introduce iommu_table_alloc() helper
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This replaces multiple calls of kzalloc_node() with a new
iommu_table_alloc() helper. Right now it calls kzalloc_node() but
later it will be modified to allocate a powerpc_iommu struct with
a single iommu_table in it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h       |  1 +
 arch/powerpc/kernel/iommu.c            |  9 +++++++++
 arch/powerpc/platforms/powernv/pci.c   |  2 +-
 arch/powerpc/platforms/pseries/iommu.c | 12 ++++--------
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index eb5822d..335e3d4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -117,6 +117,7 @@ static inline void *get_iommu_table_base(struct device *dev)
 	return dev->archdata.dma_data.iommu_table_base;
 }
 
+extern struct iommu_table *iommu_table_alloc(int node);
 /* Frees table for an individual device node */
 extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c51ad3e..2f7e92b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -710,6 +710,15 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
+struct iommu_table *iommu_table_alloc(int node)
+{
+	struct iommu_table *tbl;
+
+	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+
+	return tbl;
+}
+
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
 	unsigned long bitmap_sz;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index c4782b1..bbe529b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -693,7 +693,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		       hose->dn->full_name);
 		return NULL;
 	}
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, hose->node);
+	tbl = iommu_table_alloc(hose->node);
 	if (WARN_ON(!tbl))
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 1aa1815..bc14299 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -617,8 +617,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pci->phb->dma_window_size = 0x8000000ul;
 	pci->phb->dma_window_base_cur = 0x8000000ul;
 
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-			   pci->phb->node);
+	tbl = iommu_table_alloc(pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
@@ -669,8 +668,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		 pdn->full_name, ppci->iommu_table);
 
 	if (!ppci->iommu_table) {
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   ppci->phb->node);
+		tbl = iommu_table_alloc(ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
@@ -697,8 +695,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		struct pci_controller *phb = PCI_DN(dn)->phb;
 
 		pr_debug(" --> first child, no bridge. Allocating iommu table.\n");
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   phb->node);
+		tbl = iommu_table_alloc(phb->node);
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
@@ -1120,8 +1117,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 
 	pci = PCI_DN(pdn);
 	if (!pci->iommu_table) {
-		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
-				   pci->phb->node);
+		tbl = iommu_table_alloc(pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
for TCE tables. Right now just one table is supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |  18 ++--
 arch/powerpc/kernel/eeh.c                   |   2 +-
 arch/powerpc/kernel/iommu.c                 |  34 ++++----
 arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
 arch/powerpc/platforms/powernv/pci.c        |   2 +-
 arch/powerpc/platforms/powernv/pci.h        |   4 +-
 arch/powerpc/platforms/pseries/iommu.c      |   9 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
 9 files changed, 170 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 335e3d4..4fe5555 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -90,9 +90,7 @@ struct iommu_table {
 	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 	unsigned long  it_page_shift;/* table iommu page size */
-#ifdef CONFIG_IOMMU_API
-	struct iommu_group *it_group;
-#endif
+	struct powerpc_iommu *it_iommu;
 	struct iommu_table_ops *it_ops;
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
@@ -126,13 +124,23 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+
+#define POWERPC_IOMMU_MAX_TABLES	1
+
+struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
-extern void iommu_register_group(struct iommu_table *tbl,
+	struct iommu_group *group;
+#endif
+	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+};
+
+#ifdef CONFIG_IOMMU_API
+extern void iommu_register_group(struct powerpc_iommu *iommu,
 				 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
-static inline void iommu_register_group(struct iommu_table *tbl,
+static inline void iommu_register_group(struct powerpc_iommu *iommu,
 					int pci_domain_number,
 					unsigned long pe_num)
 {
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index e1b6d8e..319eae3 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1360,7 +1360,7 @@ static int dev_has_iommu_table(struct device *dev, void *data)
 		return 0;
 
 	tbl = get_iommu_table_base(dev);
-	if (tbl && tbl->it_group) {
+	if (tbl && tbl->it_iommu) {
 		*ppdev = pdev;
 		return 1;
 	}
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2f7e92b..952939f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 
 struct iommu_table *iommu_table_alloc(int node)
 {
-	struct iommu_table *tbl;
+	struct powerpc_iommu *iommu;
 
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+	iommu = kzalloc_node(sizeof(struct powerpc_iommu), GFP_KERNEL,
+			   node);
+	iommu->tables[0].it_iommu = iommu;
 
-	return tbl;
+	return &iommu->tables[0];
 }
 
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct powerpc_iommu *iommu = tbl->it_iommu;
 
 	if (!tbl || !tbl->it_map) {
 		printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
@@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 		clear_bit(0, tbl->it_map);
 
 #ifdef CONFIG_IOMMU_API
-	if (tbl->it_group) {
-		iommu_group_put(tbl->it_group);
-		BUG_ON(tbl->it_group);
+	if (iommu->group) {
+		iommu_group_put(iommu->group);
+		BUG_ON(iommu->group);
 	}
 #endif
 
@@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	free_pages((unsigned long) tbl->it_map, order);
 
 	/* free table */
-	kfree(tbl);
+	kfree(iommu);
 }
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
@@ -888,11 +891,12 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
  */
 static void group_release(void *iommu_data)
 {
-	struct iommu_table *tbl = iommu_data;
-	tbl->it_group = NULL;
+	struct powerpc_iommu *iommu = iommu_data;
+
+	iommu->group = NULL;
 }
 
-void iommu_register_group(struct iommu_table *tbl,
+void iommu_register_group(struct powerpc_iommu *iommu,
 		int pci_domain_number, unsigned long pe_num)
 {
 	struct iommu_group *grp;
@@ -904,8 +908,8 @@ void iommu_register_group(struct iommu_table *tbl,
 				PTR_ERR(grp));
 		return;
 	}
-	tbl->it_group = grp;
-	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu->group = grp;
+	iommu_group_set_iommudata(grp, iommu, group_release);
 	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			pci_domain_number, pe_num);
 	if (!name)
@@ -1080,7 +1084,7 @@ int iommu_add_device(struct device *dev)
 	}
 
 	tbl = get_iommu_table_base(dev);
-	if (!tbl || !tbl->it_group) {
+	if (!tbl || !tbl->it_iommu || !tbl->it_iommu->group) {
 		pr_debug("%s: Skipping device %s with no tbl\n",
 			 __func__, dev_name(dev));
 		return 0;
@@ -1088,7 +1092,7 @@ int iommu_add_device(struct device *dev)
 
 	pr_debug("%s: Adding %s to iommu group %d\n",
 		 __func__, dev_name(dev),
-		 iommu_group_id(tbl->it_group));
+		 iommu_group_id(tbl->it_iommu->group));
 
 	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
 		pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
@@ -1097,7 +1101,7 @@ int iommu_add_device(struct device *dev)
 		return -EINVAL;
 	}
 
-	return iommu_group_add_device(tbl->it_group, dev);
+	return iommu_group_add_device(tbl->it_iommu->group, dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index af7a689..8ab00e3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include <linux/io.h>
 #include <linux/msi.h>
 #include <linux/memblock.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -966,7 +967,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, &pe->iommu.tables[0]);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -993,7 +994,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, &pe->iommu.tables[0]);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1030,9 +1031,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       &pe->iommu.tables[0]);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, &pe->iommu.tables[0]);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1122,8 +1123,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
+					      iommu);
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1188,8 +1189,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		}
 	}
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->iommu.tables[0];
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1210,7 +1214,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(&pe->iommu, phb->hose->global_number,
+			pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -1228,8 +1233,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
+					      iommu);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1274,10 +1279,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->iommu.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(&pe->iommu.tables[0], true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1324,8 +1329,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		goto fail;
 	}
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->iommu.tables[0];
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
@@ -1344,7 +1352,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(&pe->iommu, phb->hose->global_number,
+			pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 0256fcc..e8af682 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -86,14 +86,15 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
 static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
-	if (phb->p5ioc2.iommu_table.it_map == NULL) {
-		phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
-		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
-		iommu_register_group(&phb->p5ioc2.iommu_table,
+	if (phb->p5ioc2.iommu.tables[0].it_map == NULL) {
+		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_iommu_ops;
+		iommu_init_table(&phb->p5ioc2.iommu.tables[0], phb->hose->node);
+		iommu_register_group(&phb->p5ioc2.iommu,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
-	set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
+	set_iommu_table_base_and_group(&pdev->dev,
+			&phb->p5ioc2.iommu.tables[0]);
 }
 
 static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
@@ -167,9 +168,12 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
 	/* Setup MSI support */
 	pnv_pci_init_p5ioc2_msis(phb);
 
+	/* Setup iommu */
+	phb->p5ioc2.iommu.tables[0].it_iommu = &phb->p5ioc2.iommu;
+
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
-	pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
+	pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu.tables[0],
 				  tce_mem, tce_size, 0,
 				  IOMMU_PAGE_SHIFT_4K);
 }
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bbe529b..e6f2c43 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -700,7 +700,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, hose->node);
-	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+	iommu_register_group(tbl->it_iommu, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f726700..19f3985 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct powerpc_iommu    iommu;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
@@ -138,7 +138,7 @@ struct pnv_phb {
 
 	union {
 		struct {
-			struct iommu_table iommu_table;
+			struct powerpc_iommu iommu;
 		} p5ioc2;
 
 		struct {
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index bc14299..f537e6e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -622,7 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(tbl, pci_domain_nr(bus), 0);
+	iommu_register_group(tbl->it_iommu, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -672,7 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(bus), 0);
+		iommu_register_group(tbl->it_iommu, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
 }
@@ -699,7 +699,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
-		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+		iommu_register_group(tbl->it_iommu, pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
 					       PCI_DN(dn)->iommu_table);
 		return;
@@ -1121,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+		iommu_register_group(tbl->it_iommu,
+				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
 	} else {
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 29d5708..28909e1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_table *tbl;
+	struct iommu_group *grp;
 	bool enabled;
 };
 
@@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 	return false;
 }
 
+static struct iommu_table *spapr_tce_find_table(
+		struct tce_container *container,
+		phys_addr_t ioba)
+{
+	long i;
+	struct iommu_table *ret = NULL;
+	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
+
+	mutex_lock(&container->lock);
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+		unsigned long entry = ioba >> tbl->it_page_shift;
+		unsigned long start = tbl->it_offset;
+		unsigned long end = start + tbl->it_size;
+
+		if ((start <= entry) && (entry < end)) {
+			ret = tbl;
+			break;
+		}
+	}
+	mutex_unlock(&container->lock);
+
+	return ret;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
+	struct powerpc_iommu *iommu;
+	struct iommu_table *tbl;
 
-	if (!container->tbl)
+	if (!container->grp)
 		return -ENXIO;
 
-	if (!current->mm)
-		return -ESRCH; /* process exited */
-
 	if (container->enabled)
 		return -EBUSY;
 
@@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * as this information is only available from KVM and VFIO is
 	 * KVM agnostic.
 	 */
-	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	iommu = iommu_group_get_iommudata(container->grp);
+	if (!iommu)
+		return -EFAULT;
+
+	tbl = &iommu->tables[0];
+	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
 	if (ret)
 		return ret;
 
@@ -153,15 +182,23 @@ static int tce_iommu_enable(struct tce_container *container)
 
 static void tce_iommu_disable(struct tce_container *container)
 {
+	struct powerpc_iommu *iommu;
+	struct iommu_table *tbl;
+
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->tbl || !current->mm)
+	if (!container->grp || !current->mm)
 		return;
 
-	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	iommu = iommu_group_get_iommudata(container->grp);
+	if (!iommu)
+		return;
+
+	tbl = &iommu->tables[0];
+	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -186,11 +223,11 @@ static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
 
-	WARN_ON(container->tbl && !container->tbl->it_group);
+	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->tbl && container->tbl->it_group)
-		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+	if (container->grp)
+		tce_iommu_detach_group(iommu_data, container->grp);
 
 	mutex_destroy(&container->lock);
 
@@ -297,9 +334,16 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct powerpc_iommu *iommu;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		iommu = iommu_group_get_iommudata(container->grp);
+
+		tbl = &iommu->tables[0];
+		if (WARN_ON_ONCE(!tbl))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -322,14 +366,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
 		unsigned long tce;
 
-		if (!tbl)
+		if (WARN_ON(!container->grp ||
+				!iommu_group_get_iommudata(container->grp)))
 			return -ENXIO;
 
-		BUG_ON(!tbl->it_group);
-
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
 		if (copy_from_user(&param, (void __user *)arg, minsz))
@@ -342,6 +385,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
+		tbl = spapr_tce_find_table(container, param.iova);
+		if (!tbl)
+			return -ENXIO;
+
 		if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
 				(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
 			return -EINVAL;
@@ -367,9 +414,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_UNMAP_DMA: {
 		struct vfio_iommu_type1_dma_unmap param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp ||
+				!iommu_group_get_iommudata(container->grp)))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
@@ -385,6 +433,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
+		tbl = spapr_tce_find_table(container, param.iova);
+		if (!tbl)
+			return -ENXIO;
+
 		if (param.size & ~IOMMU_PAGE_MASK(tbl))
 			return -EINVAL;
 
@@ -413,10 +465,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		mutex_unlock(&container->lock);
 		return 0;
 	case VFIO_EEH_PE_OP:
-		if (!container->tbl || !container->tbl->it_group)
+		if (!container->grp)
 			return -ENODEV;
 
-		return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
+		return vfio_spapr_iommu_eeh_ioctl(container->grp,
 						  cmd, arg);
 	}
 
@@ -428,16 +480,15 @@ static int tce_iommu_attach_group(void *iommu_data,
 {
 	int ret;
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct powerpc_iommu *iommu;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->tbl) {
+	if (container->grp) {
 		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(container->grp),
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else if (container->enabled) {
@@ -445,9 +496,13 @@ static int tce_iommu_attach_group(void *iommu_data,
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else {
-		ret = iommu_take_ownership(tbl);
+		iommu = iommu_group_get_iommudata(iommu_group);
+		if (WARN_ON_ONCE(!iommu))
+			return -ENXIO;
+
+		ret = iommu_take_ownership(&iommu->tables[0]);
 		if (!ret)
-			container->tbl = tbl;
+			container->grp = iommu_group;
 	}
 
 	mutex_unlock(&container->lock);
@@ -459,26 +514,32 @@ static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct powerpc_iommu *iommu;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
-	if (tbl != container->tbl) {
+	if (iommu_group != container->grp) {
 		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
 				iommu_group_id(iommu_group),
-				iommu_group_id(tbl->it_group));
+				iommu_group_id(container->grp));
 	} else {
 		if (container->enabled) {
 			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(tbl->it_group));
+					iommu_group_id(container->grp));
 			tce_iommu_disable(container);
 		}
 
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
-		container->tbl = NULL;
-		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-		iommu_release_ownership(tbl);
+		container->grp = NULL;
+
+		iommu = iommu_group_get_iommudata(iommu_group);
+		BUG_ON(!iommu);
+
+		tce_iommu_clear(container, &iommu->tables[0],
+				iommu->tables[0].it_offset,
+				iommu->tables[0].it_size);
+
+		iommu_release_ownership(&iommu->tables[0]);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
for TCE tables. Right now just one table is supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |  18 ++--
 arch/powerpc/kernel/eeh.c                   |   2 +-
 arch/powerpc/kernel/iommu.c                 |  34 ++++----
 arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
 arch/powerpc/platforms/powernv/pci.c        |   2 +-
 arch/powerpc/platforms/powernv/pci.h        |   4 +-
 arch/powerpc/platforms/pseries/iommu.c      |   9 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
 9 files changed, 170 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 335e3d4..4fe5555 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -90,9 +90,7 @@ struct iommu_table {
 	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 	unsigned long  it_page_shift;/* table iommu page size */
-#ifdef CONFIG_IOMMU_API
-	struct iommu_group *it_group;
-#endif
+	struct powerpc_iommu *it_iommu;
 	struct iommu_table_ops *it_ops;
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
@@ -126,13 +124,23 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+
+#define POWERPC_IOMMU_MAX_TABLES	1
+
+struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
-extern void iommu_register_group(struct iommu_table *tbl,
+	struct iommu_group *group;
+#endif
+	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+};
+
+#ifdef CONFIG_IOMMU_API
+extern void iommu_register_group(struct powerpc_iommu *iommu,
 				 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
-static inline void iommu_register_group(struct iommu_table *tbl,
+static inline void iommu_register_group(struct powerpc_iommu *iommu,
 					int pci_domain_number,
 					unsigned long pe_num)
 {
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index e1b6d8e..319eae3 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1360,7 +1360,7 @@ static int dev_has_iommu_table(struct device *dev, void *data)
 		return 0;
 
 	tbl = get_iommu_table_base(dev);
-	if (tbl && tbl->it_group) {
+	if (tbl && tbl->it_iommu) {
 		*ppdev = pdev;
 		return 1;
 	}
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2f7e92b..952939f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 
 struct iommu_table *iommu_table_alloc(int node)
 {
-	struct iommu_table *tbl;
+	struct powerpc_iommu *iommu;
 
-	tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
+	iommu = kzalloc_node(sizeof(struct powerpc_iommu), GFP_KERNEL,
+			   node);
+	iommu->tables[0].it_iommu = iommu;
 
-	return tbl;
+	return &iommu->tables[0];
 }
 
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
+	struct powerpc_iommu *iommu = tbl->it_iommu;
 
 	if (!tbl || !tbl->it_map) {
 		printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
@@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 		clear_bit(0, tbl->it_map);
 
 #ifdef CONFIG_IOMMU_API
-	if (tbl->it_group) {
-		iommu_group_put(tbl->it_group);
-		BUG_ON(tbl->it_group);
+	if (iommu->group) {
+		iommu_group_put(iommu->group);
+		BUG_ON(iommu->group);
 	}
 #endif
 
@@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	free_pages((unsigned long) tbl->it_map, order);
 
 	/* free table */
-	kfree(tbl);
+	kfree(iommu);
 }
 
 /* Creates TCEs for a user provided buffer.  The user buffer must be
@@ -888,11 +891,12 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
  */
 static void group_release(void *iommu_data)
 {
-	struct iommu_table *tbl = iommu_data;
-	tbl->it_group = NULL;
+	struct powerpc_iommu *iommu = iommu_data;
+
+	iommu->group = NULL;
 }
 
-void iommu_register_group(struct iommu_table *tbl,
+void iommu_register_group(struct powerpc_iommu *iommu,
 		int pci_domain_number, unsigned long pe_num)
 {
 	struct iommu_group *grp;
@@ -904,8 +908,8 @@ void iommu_register_group(struct iommu_table *tbl,
 				PTR_ERR(grp));
 		return;
 	}
-	tbl->it_group = grp;
-	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu->group = grp;
+	iommu_group_set_iommudata(grp, iommu, group_release);
 	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			pci_domain_number, pe_num);
 	if (!name)
@@ -1080,7 +1084,7 @@ int iommu_add_device(struct device *dev)
 	}
 
 	tbl = get_iommu_table_base(dev);
-	if (!tbl || !tbl->it_group) {
+	if (!tbl || !tbl->it_iommu || !tbl->it_iommu->group) {
 		pr_debug("%s: Skipping device %s with no tbl\n",
 			 __func__, dev_name(dev));
 		return 0;
@@ -1088,7 +1092,7 @@ int iommu_add_device(struct device *dev)
 
 	pr_debug("%s: Adding %s to iommu group %d\n",
 		 __func__, dev_name(dev),
-		 iommu_group_id(tbl->it_group));
+		 iommu_group_id(tbl->it_iommu->group));
 
 	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
 		pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
@@ -1097,7 +1101,7 @@ int iommu_add_device(struct device *dev)
 		return -EINVAL;
 	}
 
-	return iommu_group_add_device(tbl->it_group, dev);
+	return iommu_group_add_device(tbl->it_iommu->group, dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index af7a689..8ab00e3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include <linux/io.h>
 #include <linux/msi.h>
 #include <linux/memblock.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -966,7 +967,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, &pe->iommu.tables[0]);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -993,7 +994,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, &pe->iommu.tables[0]);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1030,9 +1031,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       &pe->iommu.tables[0]);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, &pe->iommu.tables[0]);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1122,8 +1123,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
+					      iommu);
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1188,8 +1189,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		}
 	}
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->iommu.tables[0];
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1210,7 +1214,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(&pe->iommu, phb->hose->global_number,
+			pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -1228,8 +1233,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
+					      iommu);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1274,10 +1279,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->iommu.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(&pe->iommu.tables[0], true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1324,8 +1329,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		goto fail;
 	}
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->iommu.tables[0];
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
@@ -1344,7 +1352,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(&pe->iommu, phb->hose->global_number,
+			pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 0256fcc..e8af682 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -86,14 +86,15 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
 static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
-	if (phb->p5ioc2.iommu_table.it_map == NULL) {
-		phb->p5ioc2.iommu_table.it_ops = &pnv_iommu_ops;
-		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
-		iommu_register_group(&phb->p5ioc2.iommu_table,
+	if (phb->p5ioc2.iommu.tables[0].it_map == NULL) {
+		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_iommu_ops;
+		iommu_init_table(&phb->p5ioc2.iommu.tables[0], phb->hose->node);
+		iommu_register_group(&phb->p5ioc2.iommu,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
-	set_iommu_table_base_and_group(&pdev->dev, &phb->p5ioc2.iommu_table);
+	set_iommu_table_base_and_group(&pdev->dev,
+			&phb->p5ioc2.iommu.tables[0]);
 }
 
 static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
@@ -167,9 +168,12 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
 	/* Setup MSI support */
 	pnv_pci_init_p5ioc2_msis(phb);
 
+	/* Setup iommu */
+	phb->p5ioc2.iommu.tables[0].it_iommu = &phb->p5ioc2.iommu;
+
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
-	pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
+	pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu.tables[0],
 				  tce_mem, tce_size, 0,
 				  IOMMU_PAGE_SHIFT_4K);
 }
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bbe529b..e6f2c43 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -700,7 +700,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, hose->node);
-	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+	iommu_register_group(tbl->it_iommu, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f726700..19f3985 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct powerpc_iommu    iommu;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
@@ -138,7 +138,7 @@ struct pnv_phb {
 
 	union {
 		struct {
-			struct iommu_table iommu_table;
+			struct powerpc_iommu iommu;
 		} p5ioc2;
 
 		struct {
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index bc14299..f537e6e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -622,7 +622,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	iommu_table_setparms(pci->phb, dn, tbl);
 	tbl->it_ops = &iommu_table_pseries_ops;
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(tbl, pci_domain_nr(bus), 0);
+	iommu_register_group(tbl->it_iommu, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -672,7 +672,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(bus), 0);
+		iommu_register_group(tbl->it_iommu, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
 }
@@ -699,7 +699,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		iommu_table_setparms(phb, dn, tbl);
 		tbl->it_ops = &iommu_table_pseries_ops;
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
-		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+		iommu_register_group(tbl->it_iommu, pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
 					       PCI_DN(dn)->iommu_table);
 		return;
@@ -1121,7 +1121,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		tbl->it_ops = &iommu_table_lpar_multi_ops;
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+		iommu_register_group(tbl->it_iommu,
+				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
 	} else {
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 29d5708..28909e1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_table *tbl;
+	struct iommu_group *grp;
 	bool enabled;
 };
 
@@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 	return false;
 }
 
+static struct iommu_table *spapr_tce_find_table(
+		struct tce_container *container,
+		phys_addr_t ioba)
+{
+	long i;
+	struct iommu_table *ret = NULL;
+	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
+
+	mutex_lock(&container->lock);
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+		unsigned long entry = ioba >> tbl->it_page_shift;
+		unsigned long start = tbl->it_offset;
+		unsigned long end = start + tbl->it_size;
+
+		if ((start <= entry) && (entry < end)) {
+			ret = tbl;
+			break;
+		}
+	}
+	mutex_unlock(&container->lock);
+
+	return ret;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
+	struct powerpc_iommu *iommu;
+	struct iommu_table *tbl;
 
-	if (!container->tbl)
+	if (!container->grp)
 		return -ENXIO;
 
-	if (!current->mm)
-		return -ESRCH; /* process exited */
-
 	if (container->enabled)
 		return -EBUSY;
 
@@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * as this information is only available from KVM and VFIO is
 	 * KVM agnostic.
 	 */
-	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	iommu = iommu_group_get_iommudata(container->grp);
+	if (!iommu)
+		return -EFAULT;
+
+	tbl = &iommu->tables[0];
+	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
 	if (ret)
 		return ret;
 
@@ -153,15 +182,23 @@ static int tce_iommu_enable(struct tce_container *container)
 
 static void tce_iommu_disable(struct tce_container *container)
 {
+	struct powerpc_iommu *iommu;
+	struct iommu_table *tbl;
+
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->tbl || !current->mm)
+	if (!container->grp || !current->mm)
 		return;
 
-	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
+	iommu = iommu_group_get_iommudata(container->grp);
+	if (!iommu)
+		return;
+
+	tbl = &iommu->tables[0];
+	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -186,11 +223,11 @@ static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
 
-	WARN_ON(container->tbl && !container->tbl->it_group);
+	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->tbl && container->tbl->it_group)
-		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+	if (container->grp)
+		tce_iommu_detach_group(iommu_data, container->grp);
 
 	mutex_destroy(&container->lock);
 
@@ -297,9 +334,16 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct powerpc_iommu *iommu;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		iommu = iommu_group_get_iommudata(container->grp);
+
+		tbl = &iommu->tables[0];
+		if (WARN_ON_ONCE(!tbl))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -322,14 +366,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
 		unsigned long tce;
 
-		if (!tbl)
+		if (WARN_ON(!container->grp ||
+				!iommu_group_get_iommudata(container->grp)))
 			return -ENXIO;
 
-		BUG_ON(!tbl->it_group);
-
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
 		if (copy_from_user(&param, (void __user *)arg, minsz))
@@ -342,6 +385,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 				VFIO_DMA_MAP_FLAG_WRITE))
 			return -EINVAL;
 
+		tbl = spapr_tce_find_table(container, param.iova);
+		if (!tbl)
+			return -ENXIO;
+
 		if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
 				(param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
 			return -EINVAL;
@@ -367,9 +414,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_UNMAP_DMA: {
 		struct vfio_iommu_type1_dma_unmap param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp ||
+				!iommu_group_get_iommudata(container->grp)))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
@@ -385,6 +433,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags)
 			return -EINVAL;
 
+		tbl = spapr_tce_find_table(container, param.iova);
+		if (!tbl)
+			return -ENXIO;
+
 		if (param.size & ~IOMMU_PAGE_MASK(tbl))
 			return -EINVAL;
 
@@ -413,10 +465,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		mutex_unlock(&container->lock);
 		return 0;
 	case VFIO_EEH_PE_OP:
-		if (!container->tbl || !container->tbl->it_group)
+		if (!container->grp)
 			return -ENODEV;
 
-		return vfio_spapr_iommu_eeh_ioctl(container->tbl->it_group,
+		return vfio_spapr_iommu_eeh_ioctl(container->grp,
 						  cmd, arg);
 	}
 
@@ -428,16 +480,15 @@ static int tce_iommu_attach_group(void *iommu_data,
 {
 	int ret;
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct powerpc_iommu *iommu;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->tbl) {
+	if (container->grp) {
 		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(container->grp),
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else if (container->enabled) {
@@ -445,9 +496,13 @@ static int tce_iommu_attach_group(void *iommu_data,
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else {
-		ret = iommu_take_ownership(tbl);
+		iommu = iommu_group_get_iommudata(iommu_group);
+		if (WARN_ON_ONCE(!iommu))
+			return -ENXIO;
+
+		ret = iommu_take_ownership(&iommu->tables[0]);
 		if (!ret)
-			container->tbl = tbl;
+			container->grp = iommu_group;
 	}
 
 	mutex_unlock(&container->lock);
@@ -459,26 +514,32 @@ static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct powerpc_iommu *iommu;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
-	if (tbl != container->tbl) {
+	if (iommu_group != container->grp) {
 		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
 				iommu_group_id(iommu_group),
-				iommu_group_id(tbl->it_group));
+				iommu_group_id(container->grp));
 	} else {
 		if (container->enabled) {
 			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(tbl->it_group));
+					iommu_group_id(container->grp));
 			tce_iommu_disable(container);
 		}
 
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
-		container->tbl = NULL;
-		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-		iommu_release_ownership(tbl);
+		container->grp = NULL;
+
+		iommu = iommu_group_get_iommudata(iommu_group);
+		BUG_ON(!iommu);
+
+		tce_iommu_clear(container, &iommu->tables[0],
+				iommu->tables[0].it_offset,
+				iommu->tables[0].it_size);
+
+		iommu_release_ownership(&iommu->tables[0]);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 09/24] powerpc/iommu: Fix IOMMU ownership control functions
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Note: we might want to get rid of it as this patchset removes it_map
from tables passed to VFIO.

Changes:
v5:
* do not store bit#0 value, it has to be set for zero-based table
anyway
* removed test_and_clear_bit
* only disable bypass if succeeded
---
 arch/powerpc/kernel/iommu.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 952939f..407d0d6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1024,33 +1024,48 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
 		clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (tbl->it_offset == 0)
+			set_bit(0, tbl->it_map);
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
 	/*
 	 * Disable iommu bypass, otherwise the user can DMA to all of
 	 * our physical memory via the bypass window instead of just
 	 * the pages that has been explicitly mapped into the iommu
 	 */
-	if (tbl->set_bypass)
+	if (!ret && tbl->set_bypass)
 		tbl->set_bypass(tbl, false);
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	memset(tbl->it_map, 0, sz);
 
@@ -1058,6 +1073,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
 
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
 	/* The kernel owns the device now, we can restore the iommu bypass */
 	if (tbl->set_bypass)
 		tbl->set_bypass(tbl, true);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 09/24] powerpc/iommu: Fix IOMMU ownership control functions
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Note: we might want to get rid of it as this patchset removes it_map
from tables passed to VFIO.

Changes:
v5:
* do not store bit#0 value, it has to be set for zero-based table
anyway
* removed test_and_clear_bit
* only disable bypass if succeeded
---
 arch/powerpc/kernel/iommu.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 952939f..407d0d6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1024,33 +1024,48 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
 		clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (tbl->it_offset == 0)
+			set_bit(0, tbl->it_map);
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
 	/*
 	 * Disable iommu bypass, otherwise the user can DMA to all of
 	 * our physical memory via the bypass window instead of just
 	 * the pages that has been explicitly mapped into the iommu
 	 */
-	if (tbl->set_bypass)
+	if (!ret && tbl->set_bypass)
 		tbl->set_bypass(tbl, false);
 
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	memset(tbl->it_map, 0, sz);
 
@@ -1058,6 +1073,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
 
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
 	/* The kernel owns the device now, we can restore the iommu bypass */
 	if (tbl->set_bypass)
 		tbl->set_bypass(tbl, true);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 10/24] powerpc/powernv/ioda2: Rework IOMMU ownership control
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a powerpc_iommu_ops struct and
adds a set_ownership() callback to it which is called when an external
user takes control over the IOMMU.

This renames set_bypass() to set_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
more generic name. The bool parameter is inverted.

The callback is implemented for IODA2 only.

This replaces iommu_take_ownership()/iommu_release_ownership() calls
with the callback calls and it is up to the platform code to call
iommu_take_ownership()/iommu_release_ownership() if needed. Next patches
will remove these calls from IODA2 code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 18 +++++++++--
 arch/powerpc/kernel/iommu.c               | 53 +++++++++++++++++++++++--------
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++-----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 19 ++++++++---
 4 files changed, 90 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4fe5555..ba16aa0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -92,7 +92,6 @@ struct iommu_table {
 	unsigned long  it_page_shift;/* table iommu page size */
 	struct powerpc_iommu *it_iommu;
 	struct iommu_table_ops *it_ops;
-	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 
 #define POWERPC_IOMMU_MAX_TABLES	1
 
+struct powerpc_iommu;
+
+struct powerpc_iommu_ops {
+	/*
+	 * Switches ownership from the kernel itself to an external
+	 * user. While onwership is enabled, the kernel cannot use IOMMU
+	 * for itself.
+	 */
+	void (*set_ownership)(struct powerpc_iommu *iommu,
+			bool enable);
+};
+
 struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *group;
 #endif
 	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+	struct powerpc_iommu_ops *ops;
 };
 
 #ifdef CONFIG_IOMMU_API
@@ -219,8 +231,8 @@ extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
 		unsigned long entry);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
+extern int iommu_take_ownership(struct powerpc_iommu *iommu);
+extern void iommu_release_ownership(struct powerpc_iommu *iommu);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 407d0d6..9d06425 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,7 +1022,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_build);
 
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_table_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0;
@@ -1047,19 +1047,36 @@ int iommu_take_ownership(struct iommu_table *tbl)
 		spin_unlock(&tbl->pools[i].lock);
 	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
-	/*
-	 * Disable iommu bypass, otherwise the user can DMA to all of
-	 * our physical memory via the bypass window instead of just
-	 * the pages that has been explicitly mapped into the iommu
-	 */
-	if (!ret && tbl->set_bypass)
-		tbl->set_bypass(tbl, false);
-
-	return ret;
+	return 0;
+}
+
+static void iommu_table_release_ownership(struct iommu_table *tbl);
+
+int iommu_take_ownership(struct powerpc_iommu *iommu)
+{
+	int i, j, rc = 0;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+
+		if (!tbl->it_map)
+			continue;
+
+		rc = iommu_table_take_ownership(tbl);
+		if (rc) {
+			for (j = 0; j < i; ++j)
+				iommu_table_release_ownership(
+						&iommu->tables[j]);
+
+			return rc;
+		}
+	}
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_table_release_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
@@ -1076,10 +1093,18 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_unlock(&tbl->pools[i].lock);
 	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+}
 
-	/* The kernel owns the device now, we can restore the iommu bypass */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, true);
+extern void iommu_release_ownership(struct powerpc_iommu *iommu)
+{
+	int i;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+
+		if (tbl->it_map)
+			iommu_table_release_ownership(tbl);
+	}
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8ab00e3..a33a116 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1231,10 +1231,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
-					      iommu);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1262,7 +1260,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 		 * host side.
 		 */
 		if (pe->pdev)
-			set_iommu_table_base(&pe->pdev->dev, tbl);
+			set_iommu_table_base(&pe->pdev->dev,
+					&pe->iommu.tables[0]);
 		else
 			pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
 	}
@@ -1278,13 +1277,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	/* Install set_bypass callback for VFIO */
-	pe->iommu.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
-
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->iommu.tables[0], true);
+	pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static void pnv_ioda2_set_ownership(struct powerpc_iommu *iommu,
+				     bool enable)
+{
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
+	if (enable)
+		iommu_take_ownership(iommu);
+	else
+		iommu_release_ownership(iommu);
+
+	pnv_pci_ioda2_set_bypass(pe, !enable);
+}
+
+static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
+	.set_ownership = pnv_ioda2_set_ownership,
+};
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -1352,6 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
+	pe->iommu.ops = &pnv_pci_ioda2_ops;
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 28909e1..bcde2ef 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -478,7 +478,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 static int tce_iommu_attach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
-	int ret;
+	int ret = 0;
 	struct tce_container *container = iommu_data;
 	struct powerpc_iommu *iommu;
 
@@ -499,10 +499,17 @@ static int tce_iommu_attach_group(void *iommu_data,
 		iommu = iommu_group_get_iommudata(iommu_group);
 		if (WARN_ON_ONCE(!iommu))
 			return -ENXIO;
-
-		ret = iommu_take_ownership(&iommu->tables[0]);
-		if (!ret)
+		/*
+		 * Disable iommu bypass, otherwise the user can DMA to all of
+		 * our physical memory via the bypass window instead of just
+		 * the pages that has been explicitly mapped into the iommu
+		 */
+		if (iommu->ops && iommu->ops->set_ownership) {
+			iommu->ops->set_ownership(iommu, true);
 			container->grp = iommu_group;
+		} else {
+			return -ENODEV;
+		}
 	}
 
 	mutex_unlock(&container->lock);
@@ -539,7 +546,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 				iommu->tables[0].it_offset,
 				iommu->tables[0].it_size);
 
-		iommu_release_ownership(&iommu->tables[0]);
+		/* Kernel owns the device now, we can restore bypass */
+		if (iommu->ops && iommu->ops->set_ownership)
+			iommu->ops->set_ownership(iommu, false);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 10/24] powerpc/powernv/ioda2: Rework IOMMU ownership control
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a powerpc_iommu_ops struct and
adds a set_ownership() callback to it which is called when an external
user takes control over the IOMMU.

This renames set_bypass() to set_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
more generic name. The bool parameter is inverted.

The callback is implemented for IODA2 only.

This replaces iommu_take_ownership()/iommu_release_ownership() calls
with the callback calls and it is up to the platform code to call
iommu_take_ownership()/iommu_release_ownership() if needed. Next patches
will remove these calls from IODA2 code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 18 +++++++++--
 arch/powerpc/kernel/iommu.c               | 53 +++++++++++++++++++++++--------
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++++++++++++-----
 drivers/vfio/vfio_iommu_spapr_tce.c       | 19 ++++++++---
 4 files changed, 90 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4fe5555..ba16aa0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -92,7 +92,6 @@ struct iommu_table {
 	unsigned long  it_page_shift;/* table iommu page size */
 	struct powerpc_iommu *it_iommu;
 	struct iommu_table_ops *it_ops;
-	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
@@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 
 #define POWERPC_IOMMU_MAX_TABLES	1
 
+struct powerpc_iommu;
+
+struct powerpc_iommu_ops {
+	/*
+	 * Switches ownership from the kernel itself to an external
+	 * user. While onwership is enabled, the kernel cannot use IOMMU
+	 * for itself.
+	 */
+	void (*set_ownership)(struct powerpc_iommu *iommu,
+			bool enable);
+};
+
 struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *group;
 #endif
 	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+	struct powerpc_iommu_ops *ops;
 };
 
 #ifdef CONFIG_IOMMU_API
@@ -219,8 +231,8 @@ extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
 		unsigned long entry);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
+extern int iommu_take_ownership(struct powerpc_iommu *iommu);
+extern void iommu_release_ownership(struct powerpc_iommu *iommu);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 407d0d6..9d06425 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1022,7 +1022,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_build);
 
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_table_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0;
@@ -1047,19 +1047,36 @@ int iommu_take_ownership(struct iommu_table *tbl)
 		spin_unlock(&tbl->pools[i].lock);
 	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
-	/*
-	 * Disable iommu bypass, otherwise the user can DMA to all of
-	 * our physical memory via the bypass window instead of just
-	 * the pages that has been explicitly mapped into the iommu
-	 */
-	if (!ret && tbl->set_bypass)
-		tbl->set_bypass(tbl, false);
-
-	return ret;
+	return 0;
+}
+
+static void iommu_table_release_ownership(struct iommu_table *tbl);
+
+int iommu_take_ownership(struct powerpc_iommu *iommu)
+{
+	int i, j, rc = 0;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+
+		if (!tbl->it_map)
+			continue;
+
+		rc = iommu_table_take_ownership(tbl);
+		if (rc) {
+			for (j = 0; j < i; ++j)
+				iommu_table_release_ownership(
+						&iommu->tables[j]);
+
+			return rc;
+		}
+	}
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_table_release_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
@@ -1076,10 +1093,18 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_unlock(&tbl->pools[i].lock);
 	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+}
 
-	/* The kernel owns the device now, we can restore the iommu bypass */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, true);
+extern void iommu_release_ownership(struct powerpc_iommu *iommu)
+{
+	int i;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &iommu->tables[i];
+
+		if (tbl->it_map)
+			iommu_table_release_ownership(tbl);
+	}
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8ab00e3..a33a116 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1231,10 +1231,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
-					      iommu);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1262,7 +1260,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 		 * host side.
 		 */
 		if (pe->pdev)
-			set_iommu_table_base(&pe->pdev->dev, tbl);
+			set_iommu_table_base(&pe->pdev->dev,
+					&pe->iommu.tables[0]);
 		else
 			pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
 	}
@@ -1278,13 +1277,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	/* Install set_bypass callback for VFIO */
-	pe->iommu.tables[0].set_bypass = pnv_pci_ioda2_set_bypass;
-
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->iommu.tables[0], true);
+	pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static void pnv_ioda2_set_ownership(struct powerpc_iommu *iommu,
+				     bool enable)
+{
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
+	if (enable)
+		iommu_take_ownership(iommu);
+	else
+		iommu_release_ownership(iommu);
+
+	pnv_pci_ioda2_set_bypass(pe, !enable);
+}
+
+static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
+	.set_ownership = pnv_ioda2_set_ownership,
+};
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -1352,6 +1365,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 	tbl->it_ops = &pnv_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
+	pe->iommu.ops = &pnv_pci_ioda2_ops;
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 28909e1..bcde2ef 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -478,7 +478,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 static int tce_iommu_attach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
-	int ret;
+	int ret = 0;
 	struct tce_container *container = iommu_data;
 	struct powerpc_iommu *iommu;
 
@@ -499,10 +499,17 @@ static int tce_iommu_attach_group(void *iommu_data,
 		iommu = iommu_group_get_iommudata(iommu_group);
 		if (WARN_ON_ONCE(!iommu))
 			return -ENXIO;
-
-		ret = iommu_take_ownership(&iommu->tables[0]);
-		if (!ret)
+		/*
+		 * Disable iommu bypass, otherwise the user can DMA to all of
+		 * our physical memory via the bypass window instead of just
+		 * the pages that has been explicitly mapped into the iommu
+		 */
+		if (iommu->ops && iommu->ops->set_ownership) {
+			iommu->ops->set_ownership(iommu, true);
 			container->grp = iommu_group;
+		} else {
+			return -ENODEV;
+		}
 	}
 
 	mutex_unlock(&container->lock);
@@ -539,7 +546,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 				iommu->tables[0].it_offset,
 				iommu->tables[0].it_size);
 
-		iommu_release_ownership(&iommu->tables[0]);
+		/* Kernel owns the device now, we can restore bypass */
+		if (iommu->ops && iommu->ops->set_ownership)
+			iommu->ops->set_ownership(iommu, false);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 11/24] powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table. This approach makes it possible
to get pnv_pci_ioda_tce_invalidate() unintentionally called on p5ioc2.
Another issue is that IODA2 needs PCI addresses to invalidate the cache
and those can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.

This defines separate iommu_table_ops callbacks for p5ioc2 and IODA1/2
PHBs. They all call common pnv_tce_build/pnv_tce_free/pnv_tce_get helpers
but call PHB specific TCE invalidation helper (when needed).

This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.

The patch is pretty mechanical and behaviour is not expected to change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c   | 92 ++++++++++++++++++++++-------
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  8 ++-
 arch/powerpc/platforms/powernv/pci.c        | 76 +++++++++---------------
 arch/powerpc/platforms/powernv/pci.h        |  7 ++-
 4 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a33a116..dfc56fc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1041,18 +1041,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
-					 struct iommu_table *tbl,
-					 __be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
+		unsigned long index, unsigned long npages, bool rm)
 {
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu,
+			struct pnv_ioda_pe, iommu);
 	__be64 __iomem *invalidate = rm ?
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
 		(__be64 __iomem *)tbl->it_index;
 	unsigned long start, end, inc;
 	const unsigned shift = tbl->it_page_shift;
 
-	start = __pa(startp);
-	end = __pa(endp);
+	start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
+	end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
+			npages - 1);
 
 	/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
 	if (tbl->it_busno) {
@@ -1088,10 +1090,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
 	 */
 }
 
-static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
-					 struct iommu_table *tbl,
-					 __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
+	long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+		pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
+}
+
+static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
+		long npages)
+{
+	pnv_tce_free(tbl, index, npages);
+
+	if (tbl->it_type & TCE_PCI_SWINV_FREE)
+		pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+}
+
+struct iommu_table_ops pnv_ioda1_iommu_ops = {
+	.set = pnv_ioda1_tce_build_vm,
+	.clear = pnv_ioda1_tce_free_vm,
+	.get = pnv_tce_get,
+};
+
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+		unsigned long index, unsigned long npages, bool rm)
+{
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu,
+			struct pnv_ioda_pe, iommu);
 	unsigned long start, end, inc;
 	__be64 __iomem *invalidate = rm ?
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
@@ -1104,9 +1136,9 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	end = start;
 
 	/* Figure out the start, end and step */
-	inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
+	inc = tbl->it_offset + index / sizeof(u64);
 	start |= (inc << shift);
-	inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
+	inc = tbl->it_offset + (index + npages - 1) / sizeof(u64);
 	end |= (inc << shift);
 	inc = (0x1ull << shift);
 	mb();
@@ -1120,19 +1152,35 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
-					      iommu);
-	struct pnv_phb *phb = pe->phb;
-
-	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
+	long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
 }
 
+static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
+		long npages)
+{
+	pnv_tce_free(tbl, index, npages);
+
+	if (tbl->it_type & TCE_PCI_SWINV_FREE)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+}
+
+static struct iommu_table_ops pnv_ioda2_iommu_ops = {
+	.set = pnv_ioda2_tce_build_vm,
+	.clear = pnv_ioda2_tce_free_vm,
+	.get = pnv_tce_get,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -1212,7 +1260,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda1_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
@@ -1363,7 +1411,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index e8af682..27ddaca 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -83,11 +83,17 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb)
 static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
+	.set = pnv_tce_build,
+	.clear = pnv_tce_free,
+	.get = pnv_tce_get,
+};
+
 static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu.tables[0].it_map == NULL) {
-		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_iommu_ops;
+		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_p5ioc2_iommu_ops;
 		iommu_init_table(&phb->p5ioc2.iommu.tables[0], phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e6f2c43..3ab69e2 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -602,70 +602,48 @@ static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
 	}
 }
 
-static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
-			 unsigned long uaddr, enum dma_data_direction direction,
-			 struct dma_attrs *attrs, bool rm)
+static __be64 *pnv_tce(struct iommu_table *tbl, long index)
+{
+	__be64 *tmp = ((__be64 *)tbl->it_base);
+
+	return tmp + index;
+}
+
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
 	u64 proto_tce = pnv_dmadir_to_flags(direction);
-	__be64 *tcep, *tces;
-	u64 rpn;
+	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+	long i;
 
-	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
-	rpn = __pa(uaddr) >> tbl->it_page_shift;
+	for (i = 0; i < npages; i++) {
+		unsigned long newtce = proto_tce |
+				((rpn + i) << tbl->it_page_shift);
+		unsigned long idx = index - tbl->it_offset + i;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(proto_tce |
-				(rpn++ << tbl->it_page_shift));
-
-	/* Some implementations won't cache invalid TCEs and thus may not
-	 * need that flush. We'll probably turn it_type into a bit mask
-	 * of flags if that becomes the case
-	 */
-	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+		*(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+	}
 
 	return 0;
 }
 
-static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
-			false);
-}
-
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
-		bool rm)
-{
-	__be64 *tcep, *tces;
-
-	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+	long i;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(0);
+	for (i = 0; i < npages; i++) {
+		unsigned long idx = index - tbl->it_offset + i;
 
-	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+		*(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+	}
 }
 
-static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 {
-	pnv_tce_free(tbl, index, npages, false);
+	return *(pnv_tce(tbl, index));
 }
 
-static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
-{
-	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
-}
-
-struct iommu_table_ops pnv_iommu_ops = {
-	.set = pnv_tce_build_vm,
-	.clear = pnv_tce_free_vm,
-	.get = pnv_tce_get,
-};
-
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
 			       u64 dma_offset, unsigned page_shift)
@@ -698,7 +676,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda1_iommu_ops;
 	iommu_init_table(tbl, hose->node);
 	iommu_register_group(tbl->it_iommu, pci_domain_nr(hose->bus), 0);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 19f3985..724bce9 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,7 +216,12 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
-extern struct iommu_table_ops pnv_iommu_ops;
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+extern struct iommu_table_ops pnv_ioda1_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 11/24] powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free()
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table. This approach makes it possible
to get pnv_pci_ioda_tce_invalidate() unintentionally called on p5ioc2.
Another issue is that IODA2 needs PCI addresses to invalidate the cache
and those can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.

This defines separate iommu_table_ops callbacks for p5ioc2 and IODA1/2
PHBs. They all call common pnv_tce_build/pnv_tce_free/pnv_tce_get helpers
but call PHB specific TCE invalidation helper (when needed).

This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.

The patch is pretty mechanical and behaviour is not expected to change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c   | 92 ++++++++++++++++++++++-------
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  8 ++-
 arch/powerpc/platforms/powernv/pci.c        | 76 +++++++++---------------
 arch/powerpc/platforms/powernv/pci.h        |  7 ++-
 4 files changed, 110 insertions(+), 73 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a33a116..dfc56fc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1041,18 +1041,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
-					 struct iommu_table *tbl,
-					 __be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
+		unsigned long index, unsigned long npages, bool rm)
 {
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu,
+			struct pnv_ioda_pe, iommu);
 	__be64 __iomem *invalidate = rm ?
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
 		(__be64 __iomem *)tbl->it_index;
 	unsigned long start, end, inc;
 	const unsigned shift = tbl->it_page_shift;
 
-	start = __pa(startp);
-	end = __pa(endp);
+	start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset);
+	end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset +
+			npages - 1);
 
 	/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
 	if (tbl->it_busno) {
@@ -1088,10 +1090,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
 	 */
 }
 
-static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
-					 struct iommu_table *tbl,
-					 __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda1_tce_build_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
+	long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+		pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
+}
+
+static void pnv_ioda1_tce_free_vm(struct iommu_table *tbl, long index,
+		long npages)
+{
+	pnv_tce_free(tbl, index, npages);
+
+	if (tbl->it_type & TCE_PCI_SWINV_FREE)
+		pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+}
+
+struct iommu_table_ops pnv_ioda1_iommu_ops = {
+	.set = pnv_ioda1_tce_build_vm,
+	.clear = pnv_ioda1_tce_free_vm,
+	.get = pnv_tce_get,
+};
+
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+		unsigned long index, unsigned long npages, bool rm)
+{
+	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu,
+			struct pnv_ioda_pe, iommu);
 	unsigned long start, end, inc;
 	__be64 __iomem *invalidate = rm ?
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
@@ -1104,9 +1136,9 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	end = start;
 
 	/* Figure out the start, end and step */
-	inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
+	inc = tbl->it_offset + index / sizeof(u64);
 	start |= (inc << shift);
-	inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
+	inc = tbl->it_offset + (index + npages - 1) / sizeof(u64);
 	end |= (inc << shift);
 	inc = (0x1ull << shift);
 	mb();
@@ -1120,19 +1152,35 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 __be64 *startp, __be64 *endp, bool rm)
+static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl->it_iommu, struct pnv_ioda_pe,
-					      iommu);
-	struct pnv_phb *phb = pe->phb;
-
-	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
+	long ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
 }
 
+static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
+		long npages)
+{
+	pnv_tce_free(tbl, index, npages);
+
+	if (tbl->it_type & TCE_PCI_SWINV_FREE)
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+}
+
+static struct iommu_table_ops pnv_ioda2_iommu_ops = {
+	.set = pnv_ioda2_tce_build_vm,
+	.clear = pnv_ioda2_tce_free_vm,
+	.get = pnv_tce_get,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -1212,7 +1260,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda1_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
@@ -1363,7 +1411,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
 	iommu_init_table(tbl, phb->hose->node);
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index e8af682..27ddaca 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -83,11 +83,17 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb)
 static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+static struct iommu_table_ops pnv_p5ioc2_iommu_ops = {
+	.set = pnv_tce_build,
+	.clear = pnv_tce_free,
+	.get = pnv_tce_get,
+};
+
 static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu.tables[0].it_map == NULL) {
-		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_iommu_ops;
+		phb->p5ioc2.iommu.tables[0].it_ops = &pnv_p5ioc2_iommu_ops;
 		iommu_init_table(&phb->p5ioc2.iommu.tables[0], phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index e6f2c43..3ab69e2 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -602,70 +602,48 @@ static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
 	}
 }
 
-static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
-			 unsigned long uaddr, enum dma_data_direction direction,
-			 struct dma_attrs *attrs, bool rm)
+static __be64 *pnv_tce(struct iommu_table *tbl, long index)
+{
+	__be64 *tmp = ((__be64 *)tbl->it_base);
+
+	return tmp + index;
+}
+
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs)
 {
 	u64 proto_tce = pnv_dmadir_to_flags(direction);
-	__be64 *tcep, *tces;
-	u64 rpn;
+	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+	long i;
 
-	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
-	rpn = __pa(uaddr) >> tbl->it_page_shift;
+	for (i = 0; i < npages; i++) {
+		unsigned long newtce = proto_tce |
+				((rpn + i) << tbl->it_page_shift);
+		unsigned long idx = index - tbl->it_offset + i;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(proto_tce |
-				(rpn++ << tbl->it_page_shift));
-
-	/* Some implementations won't cache invalid TCEs and thus may not
-	 * need that flush. We'll probably turn it_type into a bit mask
-	 * of flags if that becomes the case
-	 */
-	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+		*(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+	}
 
 	return 0;
 }
 
-static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
-			false);
-}
-
-static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
-		bool rm)
-{
-	__be64 *tcep, *tces;
-
-	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+	long i;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(0);
+	for (i = 0; i < npages; i++) {
+		unsigned long idx = index - tbl->it_offset + i;
 
-	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+		*(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+	}
 }
 
-static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 {
-	pnv_tce_free(tbl, index, npages, false);
+	return *(pnv_tce(tbl, index));
 }
 
-static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
-{
-	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
-}
-
-struct iommu_table_ops pnv_iommu_ops = {
-	.set = pnv_tce_build_vm,
-	.clear = pnv_tce_free_vm,
-	.get = pnv_tce_get,
-};
-
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
 			       u64 dma_offset, unsigned page_shift)
@@ -698,7 +676,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
-	tbl->it_ops = &pnv_iommu_ops;
+	tbl->it_ops = &pnv_ioda1_iommu_ops;
 	iommu_init_table(tbl, hose->node);
 	iommu_register_group(tbl->it_iommu, pci_domain_nr(hose->bus), 0);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 19f3985..724bce9 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,7 +216,12 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
-extern struct iommu_table_ops pnv_iommu_ops;
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+		unsigned long uaddr, enum dma_data_direction direction,
+		struct dma_attrs *attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+extern struct iommu_table_ops pnv_ioda1_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This implements exchange() for IODA2 only. This adds a requirement
for a platform to have exchange() implemented so from now on IODA2 is
the only supported PHB for VFIO-SPAPR.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 13 +++++---
 arch/powerpc/kernel/iommu.c               | 50 +++++++++++--------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++
 arch/powerpc/platforms/powernv/pci.c      | 22 ++++++++++++++
 arch/powerpc/platforms/powernv/pci.h      |  4 +++
 drivers/vfio/vfio_iommu_spapr_tce.c       | 36 ++++++++++++++--------
 6 files changed, 92 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ba16aa0..bf26d47 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
 			unsigned long uaddr,
 			enum dma_data_direction direction,
 			struct dma_attrs *attrs);
+	int (*exchange)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			unsigned long *old_tces,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
 	unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -225,10 +231,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long npages);
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-		unsigned long entry);
+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+		unsigned long hwaddr, unsigned long *oldtce,
+		enum dma_data_direction direction);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct powerpc_iommu *iommu);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9d06425..26feaff 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -974,44 +974,18 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
-
-	spin_lock(&(pool->lock));
-
-	oldtce = tbl->it_ops->get(tbl, entry);
-	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		tbl->it_ops->clear(tbl, entry, 1);
-	else
-		oldtce = 0;
-
-	spin_unlock(&(pool->lock));
-
-	return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 /*
  * hwaddr is a kernel virtual address here (0xc... bazillion),
  * tce_build converts it to a physical address.
  */
-int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction)
+long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+		unsigned long hwaddr, unsigned long *oldtce,
+		enum dma_data_direction direction)
 {
-	int ret = -EBUSY;
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
+	long ret;
 
-	spin_lock(&(pool->lock));
-
-	oldtce = tbl->it_ops->get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
-
-	spin_unlock(&(pool->lock));
+	ret = tbl->it_ops->exchange(tbl, entry, 1, hwaddr, oldtce,
+			direction, NULL);
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1020,13 +994,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_build);
+EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
 static int iommu_table_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0;
 
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires exchange() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->exchange)
+		return -EINVAL;
+
 	spin_lock_irqsave(&tbl->large_pool.lock, flags);
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index dfc56fc..6d279d5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1166,6 +1166,21 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
 	return ret;
 }
 
+static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
+{
+	long ret = pnv_tce_xchg(tbl, index, npages, uaddr, old_tces, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
+}
+
 static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
 		long npages)
 {
@@ -1177,6 +1192,7 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build_vm,
+	.exchange = pnv_ioda2_tce_xchg_vm,
 	.clear = pnv_ioda2_tce_free_vm,
 	.get = pnv_tce_get,
 };
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 3ab69e2..cf8206b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,6 +628,28 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	return 0;
 }
 
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
+{
+	u64 proto_tce = pnv_dmadir_to_flags(direction);
+	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+	long i;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long newtce = proto_tce |
+				((rpn + i) << tbl->it_page_shift);
+		unsigned long idx = index - tbl->it_offset + i;
+		unsigned long oldtce = xchg(pnv_tce(tbl, idx),
+				cpu_to_be64(newtce));
+
+		old_tces[i] = (unsigned long) __va(be64_to_cpu(oldtce));
+	}
+
+	return 0;
+}
+
 void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
 	long i;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 724bce9..6491581 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -220,6 +220,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction,
 		struct dma_attrs *attrs);
 extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs);
 extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
 extern struct iommu_table_ops pnv_ioda1_iommu_ops;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index bcde2ef..8256275 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -234,25 +234,34 @@ static void tce_iommu_release(void *iommu_data)
 	kfree(container);
 }
 
+static void tce_iommu_unuse_page(unsigned long oldtce)
+{
+	struct page *page;
+
+	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+		return;
+
+	page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
+	if (oldtce & TCE_PCI_WRITE)
+		SetPageDirty(page);
+
+	put_page(page);
+}
+
 static int tce_iommu_clear(struct tce_container *container,
 		struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages)
 {
+	long ret;
 	unsigned long oldtce;
-	struct page *page;
 
 	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
+		oldtce = 0;
+		ret = iommu_tce_xchg(tbl, entry, 0, &oldtce, DMA_NONE);
+		if (ret)
 			continue;
 
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
+		tce_iommu_unuse_page(oldtce);
 	}
 
 	return 0;
@@ -276,7 +285,7 @@ static long tce_iommu_build(struct tce_container *container,
 {
 	long i, ret = 0;
 	struct page *page = NULL;
-	unsigned long hva;
+	unsigned long hva, oldtce;
 	enum dma_data_direction direction = tce_iommu_direction(tce);
 
 	for (i = 0; i < pages; ++i) {
@@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
 
 		hva = (unsigned long) page_address(page) +
 			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+		oldtce = 0;
 
-		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
+		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
 		if (ret) {
 			put_page(page);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
@@ -303,6 +313,8 @@ static long tce_iommu_build(struct tce_container *container,
 					tce, ret);
 			break;
 		}
+
+		tce_iommu_unuse_page(oldtce);
 		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This implements exchange() for IODA2 only. This adds a requirement
for a platform to have exchange() implemented so from now on IODA2 is
the only supported PHB for VFIO-SPAPR.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 13 +++++---
 arch/powerpc/kernel/iommu.c               | 50 +++++++++++--------------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++
 arch/powerpc/platforms/powernv/pci.c      | 22 ++++++++++++++
 arch/powerpc/platforms/powernv/pci.h      |  4 +++
 drivers/vfio/vfio_iommu_spapr_tce.c       | 36 ++++++++++++++--------
 6 files changed, 92 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ba16aa0..bf26d47 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
 			unsigned long uaddr,
 			enum dma_data_direction direction,
 			struct dma_attrs *attrs);
+	int (*exchange)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			unsigned long *old_tces,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
 	unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -225,10 +231,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long npages);
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-		unsigned long entry);
+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+		unsigned long hwaddr, unsigned long *oldtce,
+		enum dma_data_direction direction);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct powerpc_iommu *iommu);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9d06425..26feaff 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -974,44 +974,18 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
-
-	spin_lock(&(pool->lock));
-
-	oldtce = tbl->it_ops->get(tbl, entry);
-	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		tbl->it_ops->clear(tbl, entry, 1);
-	else
-		oldtce = 0;
-
-	spin_unlock(&(pool->lock));
-
-	return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 /*
  * hwaddr is a kernel virtual address here (0xc... bazillion),
  * tce_build converts it to a physical address.
  */
-int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-		unsigned long hwaddr, enum dma_data_direction direction)
+long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+		unsigned long hwaddr, unsigned long *oldtce,
+		enum dma_data_direction direction)
 {
-	int ret = -EBUSY;
-	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
+	long ret;
 
-	spin_lock(&(pool->lock));
-
-	oldtce = tbl->it_ops->get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
-
-	spin_unlock(&(pool->lock));
+	ret = tbl->it_ops->exchange(tbl, entry, 1, hwaddr, oldtce,
+			direction, NULL);
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1020,13 +994,23 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_build);
+EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
 static int iommu_table_take_ownership(struct iommu_table *tbl)
 {
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0;
 
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires exchange() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->exchange)
+		return -EINVAL;
+
 	spin_lock_irqsave(&tbl->large_pool.lock, flags);
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index dfc56fc..6d279d5 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1166,6 +1166,21 @@ static int pnv_ioda2_tce_build_vm(struct iommu_table *tbl, long index,
 	return ret;
 }
 
+static int pnv_ioda2_tce_xchg_vm(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
+{
+	long ret = pnv_tce_xchg(tbl, index, npages, uaddr, old_tces, direction,
+			attrs);
+
+	if (!ret && (tbl->it_type &
+			(TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE)))
+		pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
+
+	return ret;
+}
+
 static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
 		long npages)
 {
@@ -1177,6 +1192,7 @@ static void pnv_ioda2_tce_free_vm(struct iommu_table *tbl, long index,
 
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 	.set = pnv_ioda2_tce_build_vm,
+	.exchange = pnv_ioda2_tce_xchg_vm,
 	.clear = pnv_ioda2_tce_free_vm,
 	.get = pnv_tce_get,
 };
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 3ab69e2..cf8206b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,6 +628,28 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	return 0;
 }
 
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs)
+{
+	u64 proto_tce = pnv_dmadir_to_flags(direction);
+	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+	long i;
+
+	for (i = 0; i < npages; i++) {
+		unsigned long newtce = proto_tce |
+				((rpn + i) << tbl->it_page_shift);
+		unsigned long idx = index - tbl->it_offset + i;
+		unsigned long oldtce = xchg(pnv_tce(tbl, idx),
+				cpu_to_be64(newtce));
+
+		old_tces[i] = (unsigned long) __va(be64_to_cpu(oldtce));
+	}
+
+	return 0;
+}
+
 void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
 	long i;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 724bce9..6491581 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -220,6 +220,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction,
 		struct dma_attrs *attrs);
 extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+		long npages, unsigned long uaddr, unsigned long *old_tces,
+		enum dma_data_direction direction,
+		struct dma_attrs *attrs);
 extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
 extern struct iommu_table_ops pnv_ioda1_iommu_ops;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index bcde2ef..8256275 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -234,25 +234,34 @@ static void tce_iommu_release(void *iommu_data)
 	kfree(container);
 }
 
+static void tce_iommu_unuse_page(unsigned long oldtce)
+{
+	struct page *page;
+
+	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+		return;
+
+	page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
+	if (oldtce & TCE_PCI_WRITE)
+		SetPageDirty(page);
+
+	put_page(page);
+}
+
 static int tce_iommu_clear(struct tce_container *container,
 		struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages)
 {
+	long ret;
 	unsigned long oldtce;
-	struct page *page;
 
 	for ( ; pages; --pages, ++entry) {
-		oldtce = iommu_clear_tce(tbl, entry);
-		if (!oldtce)
+		oldtce = 0;
+		ret = iommu_tce_xchg(tbl, entry, 0, &oldtce, DMA_NONE);
+		if (ret)
 			continue;
 
-		page = pfn_to_page(oldtce >> PAGE_SHIFT);
-		WARN_ON(!page);
-		if (page) {
-			if (oldtce & TCE_PCI_WRITE)
-				SetPageDirty(page);
-			put_page(page);
-		}
+		tce_iommu_unuse_page(oldtce);
 	}
 
 	return 0;
@@ -276,7 +285,7 @@ static long tce_iommu_build(struct tce_container *container,
 {
 	long i, ret = 0;
 	struct page *page = NULL;
-	unsigned long hva;
+	unsigned long hva, oldtce;
 	enum dma_data_direction direction = tce_iommu_direction(tce);
 
 	for (i = 0; i < pages; ++i) {
@@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
 
 		hva = (unsigned long) page_address(page) +
 			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
+		oldtce = 0;
 
-		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
+		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
 		if (ret) {
 			put_page(page);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
@@ -303,6 +313,8 @@ static long tce_iommu_build(struct tce_container *container,
 					tce, ret);
 			break;
 		}
+
+		tce_iommu_unuse_page(oldtce);
 		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 13/24] powerpc/pseries/lpar: Enable VFIO
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* added global lock for xchg operations
* added missing be64_to_cpu(oldtce)
---
 arch/powerpc/platforms/pseries/iommu.c | 44 ++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index f537e6e..a903a27 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -137,14 +137,25 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
+				unsigned long *old_tces,
 				enum dma_data_direction direction,
 				struct dma_attrs *attrs)
 {
 	u64 rc = 0;
 	u64 proto_tce, tce;
 	u64 rpn;
-	int ret = 0;
+	int ret = 0, i = 0;
 	long tcenum_start = tcenum, npages_start = npages;
+	static spinlock_t get_tces_lock;
+	static bool get_tces_lock_initialized;
+
+	if (old_tces) {
+		if (!get_tces_lock_initialized) {
+			spin_lock_init(&get_tces_lock);
+			get_tces_lock_initialized = true;
+		}
+		spin_lock(&get_tces_lock);
+	}
 
 	rpn = __pa(uaddr) >> TCE_SHIFT;
 	proto_tce = TCE_PCI_READ;
@@ -153,6 +164,14 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	while (npages--) {
 		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+		if (old_tces) {
+			unsigned long oldtce = 0;
+
+			plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12,
+					&oldtce);
+			old_tces[i] = be64_to_cpu(oldtce);
+			i++;
+		}
 		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
 		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -173,13 +192,18 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		tcenum++;
 		rpn++;
 	}
+
+	if (old_tces)
+		spin_unlock(&get_tces_lock);
+
 	return ret;
 }
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs)
 {
@@ -194,6 +218,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					   old_tces,
 		                           direction, attrs);
 	}
 
@@ -210,6 +235,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					    old_tces,
 					    direction, attrs);
 		}
 		__this_cpu_write(tce_page, tcep);
@@ -231,6 +257,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		for (l = 0; l < limit; l++) {
 			tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
 			rpn++;
+			if (old_tces)
+				plpar_tce_get((u64)tbl->it_index,
+						(u64)(tcenum + l) << 12,
+						&old_tces[tcenum + l]);
 		}
 
 		rc = plpar_tce_put_indirect((u64)tbl->it_index,
@@ -261,6 +291,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+				     long npages, unsigned long uaddr,
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs)
+{
+	return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+			direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
 {
 	u64 rc;
@@ -634,6 +673,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
+	.exchange = tce_xchg_pSeriesLP,
 	.clear = tce_freemulti_pSeriesLP,
 	.get = tce_get_pSeriesLP
 };
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 13/24] powerpc/pseries/lpar: Enable VFIO
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* added global lock for xchg operations
* added missing be64_to_cpu(oldtce)
---
 arch/powerpc/platforms/pseries/iommu.c | 44 ++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index f537e6e..a903a27 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -137,14 +137,25 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
+				unsigned long *old_tces,
 				enum dma_data_direction direction,
 				struct dma_attrs *attrs)
 {
 	u64 rc = 0;
 	u64 proto_tce, tce;
 	u64 rpn;
-	int ret = 0;
+	int ret = 0, i = 0;
 	long tcenum_start = tcenum, npages_start = npages;
+	static spinlock_t get_tces_lock;
+	static bool get_tces_lock_initialized;
+
+	if (old_tces) {
+		if (!get_tces_lock_initialized) {
+			spin_lock_init(&get_tces_lock);
+			get_tces_lock_initialized = true;
+		}
+		spin_lock(&get_tces_lock);
+	}
 
 	rpn = __pa(uaddr) >> TCE_SHIFT;
 	proto_tce = TCE_PCI_READ;
@@ -153,6 +164,14 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	while (npages--) {
 		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+		if (old_tces) {
+			unsigned long oldtce = 0;
+
+			plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12,
+					&oldtce);
+			old_tces[i] = be64_to_cpu(oldtce);
+			i++;
+		}
 		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
 		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -173,13 +192,18 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		tcenum++;
 		rpn++;
 	}
+
+	if (old_tces)
+		spin_unlock(&get_tces_lock);
+
 	return ret;
 }
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs)
 {
@@ -194,6 +218,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					   old_tces,
 		                           direction, attrs);
 	}
 
@@ -210,6 +235,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					    old_tces,
 					    direction, attrs);
 		}
 		__this_cpu_write(tce_page, tcep);
@@ -231,6 +257,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		for (l = 0; l < limit; l++) {
 			tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
 			rpn++;
+			if (old_tces)
+				plpar_tce_get((u64)tbl->it_index,
+						(u64)(tcenum + l) << 12,
+						&old_tces[tcenum + l]);
 		}
 
 		rc = plpar_tce_put_indirect((u64)tbl->it_index,
@@ -261,6 +291,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+				     long npages, unsigned long uaddr,
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs)
+{
+	return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+			direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
 {
 	u64 rc;
@@ -634,6 +673,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
+	.exchange = tce_xchg_pSeriesLP,
 	.clear = tce_freemulti_pSeriesLP,
 	.get = tce_get_pSeriesLP
 };
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

The existing implementation accounts the whole DMA window in
the locked_vm counter which is going to be even worse with multiple
containers and huge DMA windows.

This introduces 2 ioctls to register/unregister DMA memory which
receive user space address and size of the memory region which
needs to be pinned/unpinned and counted in locked_vm.

If any memory region was registered, all subsequent DMA map requests
should address already pinned memory. If no memory was registered,
then the amount of memory required for a single default memory will be
accounted when the container is enabled and every map/unmap will pin/unpin
a page.

Dynamic DMA window and in-kernel acceleration will require memory to
be registered in order to work.

The accounting is done per VFIO container. When the support of
multiple groups per container is added, we will have accurate locked_vm
accounting.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h           |  29 ++++
 2 files changed, 331 insertions(+), 31 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8256275..d0987ae 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -86,8 +86,169 @@ struct tce_container {
 	struct mutex lock;
 	struct iommu_group *grp;
 	bool enabled;
+	struct list_head mem_list;
 };
 
+struct tce_memory {
+	struct list_head next;
+	struct rcu_head rcu;
+	__u64 vaddr;
+	__u64 size;
+	__u64 pfns[];
+};
+
+static void tce_unpin_pages(struct tce_container *container,
+		struct tce_memory *mem, __u64 vaddr, __u64 size)
+{
+	__u64 off;
+	struct page *page = NULL;
+
+
+	for (off = 0; off < size; off += PAGE_SIZE) {
+		if (!mem->pfns[off >> PAGE_SHIFT])
+			continue;
+
+		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
+		if (!page)
+			continue;
+
+		put_page(page);
+		mem->pfns[off >> PAGE_SHIFT] = 0;
+	}
+}
+
+static void release_tce_memory(struct rcu_head *head)
+{
+	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
+
+	kfree(mem);
+}
+
+static void tce_do_unregister_pages(struct tce_container *container,
+		struct tce_memory *mem)
+{
+	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
+	decrement_locked_vm(mem->size);
+	list_del_rcu(&mem->next);
+	call_rcu_sched(&mem->rcu, release_tce_memory);
+}
+
+static long tce_unregister_pages(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	struct tce_memory *mem, *memtmp;
+
+	if (container->enabled)
+		return -EBUSY;
+
+	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
+		return -EINVAL;
+
+	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
+		if ((mem->vaddr == vaddr) && (mem->size == size)) {
+			tce_do_unregister_pages(container, mem);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static long tce_pin_pages(struct tce_container *container,
+		struct tce_memory *mem, __u64 vaddr, __u64 size)
+{
+	__u64 off;
+	struct page *page = NULL;
+
+	for (off = 0; off < size; off += PAGE_SIZE) {
+		if (1 != get_user_pages_fast(vaddr + off,
+					1/* pages */, 1/* iswrite */, &page)) {
+			tce_unpin_pages(container, mem, vaddr, off);
+			return -EFAULT;
+		}
+
+		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
+	}
+
+	return 0;
+}
+
+static long tce_register_pages(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	long ret;
+	struct tce_memory *mem;
+
+	if (container->enabled)
+		return -EBUSY;
+
+	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
+			((vaddr + size) < vaddr))
+		return -EINVAL;
+
+	/* Any overlap with registered chunks? */
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr < (vaddr + size)) &&
+				(vaddr < (mem->vaddr + mem->size))) {
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+	}
+
+	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
+	if (ret)
+		goto unlock_exit;
+
+	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);
+	if (!mem)
+		goto unlock_exit;
+
+	if (tce_pin_pages(container, mem, vaddr, size))
+		goto free_exit;
+
+	mem->vaddr = vaddr;
+	mem->size = size;
+
+	list_add_rcu(&mem->next, &container->mem_list);
+	rcu_read_unlock();
+
+	return 0;
+
+free_exit:
+	kfree(mem);
+
+unlock_exit:
+	decrement_locked_vm(size >> PAGE_SHIFT);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool tce_preregistered(struct tce_container *container)
+{
+	return !list_empty(&container->mem_list);
+}
+
+static bool tce_pinned(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	struct tce_memory *mem;
+	bool ret = false;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr <= vaddr) &&
+				(vaddr + size <= mem->vaddr + mem->size)) {
+			ret = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static bool tce_check_page_size(struct page *page, unsigned page_shift)
 {
 	unsigned shift;
@@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * as this information is only available from KVM and VFIO is
 	 * KVM agnostic.
 	 */
-	iommu = iommu_group_get_iommudata(container->grp);
-	if (!iommu)
-		return -EFAULT;
+	if (!tce_preregistered(container)) {
+		iommu = iommu_group_get_iommudata(container->grp);
+		if (!iommu)
+			return -EFAULT;
 
-	tbl = &iommu->tables[0];
-	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
-	if (ret)
-		return ret;
+		tbl = &iommu->tables[0];
+		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		if (ret)
+			return ret;
+	}
 
 	container->enabled = true;
 
@@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!container->grp || !current->mm)
 		return;
 
-	iommu = iommu_group_get_iommudata(container->grp);
-	if (!iommu)
-		return;
+	if (!tce_preregistered(container)) {
+		iommu = iommu_group_get_iommudata(container->grp);
+		if (!iommu)
+			return;
 
-	tbl = &iommu->tables[0];
-	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		tbl = &iommu->tables[0];
+		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+	}
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
 		return ERR_PTR(-ENOMEM);
 
 	mutex_init(&container->lock);
+	INIT_LIST_HEAD_RCU(&container->mem_list);
 
 	return container;
 }
@@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
 static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
+	struct tce_memory *mem, *memtmp;
 
 	WARN_ON(container->grp);
 	tce_iommu_disable(container);
@@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
 	if (container->grp)
 		tce_iommu_detach_group(iommu_data, container->grp);
 
+	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
+		tce_do_unregister_pages(container, mem);
+
 	mutex_destroy(&container->lock);
 
 	kfree(container);
 }
 
-static void tce_iommu_unuse_page(unsigned long oldtce)
+static void tce_iommu_unuse_page(struct tce_container *container,
+		unsigned long oldtce)
 {
 	struct page *page;
+	bool do_put = !tce_preregistered(container);
 
 	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
 		return;
@@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
 	if (oldtce & TCE_PCI_WRITE)
 		SetPageDirty(page);
 
-	put_page(page);
+	if (do_put)
+		put_page(page);
 }
 
 static int tce_iommu_clear(struct tce_container *container,
@@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
 		if (ret)
 			continue;
 
-		tce_iommu_unuse_page(oldtce);
+		tce_iommu_unuse_page(container, oldtce);
 	}
 
 	return 0;
@@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
 		return DMA_NONE;
 }
 
+static unsigned long tce_get_hva_cached(struct tce_container *container,
+		unsigned page_shift, unsigned long tce)
+{
+	struct tce_memory *mem;
+	struct page *page = NULL;
+	unsigned long hva = -1;
+
+	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
+			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
+			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
+
+			page = pfn_to_page(mem->pfns[gfn]);
+
+			if (!tce_check_page_size(page, page_shift))
+				break;
+
+			hva = (unsigned long) __va(hpa);
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return hva;
+}
+
+static unsigned long tce_get_hva(struct tce_container *container,
+		unsigned page_shift, unsigned long tce)
+{
+	long ret = 0;
+	struct page *page = NULL;
+	unsigned long hva = -1;
+	enum dma_data_direction direction = tce_iommu_direction(tce);
+
+	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+			direction != DMA_TO_DEVICE, &page);
+	if (unlikely(ret != 1))
+		return -1;
+
+	if (!tce_check_page_size(page, page_shift)) {
+		put_page(page);
+		return -1;
+	}
+
+	hva = (unsigned long) page_address(page) +
+		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
+
+	return hva;
+}
+
 static long tce_iommu_build(struct tce_container *container,
 		struct iommu_table *tbl,
 		unsigned long entry, unsigned long tce, unsigned long pages)
 {
 	long i, ret = 0;
-	struct page *page = NULL;
 	unsigned long hva, oldtce;
 	enum dma_data_direction direction = tce_iommu_direction(tce);
+	bool do_put = false;
 
 	for (i = 0; i < pages; ++i) {
-		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-				direction != DMA_TO_DEVICE, &page);
-		if (unlikely(ret != 1)) {
-			ret = -EFAULT;
-			break;
+		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
+		if (hva == -1) {
+			do_put = true;
+			WARN_ON_ONCE(1);
+			hva = tce_get_hva(container, tbl->it_page_shift, tce);
 		}
 
-		if (!tce_check_page_size(page, tbl->it_page_shift)) {
-			ret = -EFAULT;
-			break;
-		}
-
-		hva = (unsigned long) page_address(page) +
-			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
 		oldtce = 0;
-
 		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
 		if (ret) {
-			put_page(page);
+			if (do_put)
+				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
 					__func__, entry << tbl->it_page_shift,
 					tce, ret);
 			break;
 		}
 
-		tce_iommu_unuse_page(oldtce);
+		if (do_put)
+			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
+
+		tce_iommu_unuse_page(container, oldtce);
+
 		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
@@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
+		/* If any memory is pinned, only allow pages from that region */
+		if (tce_preregistered(container) &&
+				!tce_pinned(container, param.vaddr, param.size))
+			return -EPERM;
+
 		ret = tce_iommu_build(container, tbl,
 				param.iova >> tbl->it_page_shift,
 				tce, param.size >> tbl->it_page_shift);
@@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 		return ret;
 	}
+	case VFIO_IOMMU_REGISTER_MEMORY: {
+		struct vfio_iommu_type1_register_memory param;
+
+		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
+				size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		/* No flag is supported now */
+		if (param.flags)
+			return -EINVAL;
+
+		mutex_lock(&container->lock);
+		ret = tce_register_pages(container, param.vaddr, param.size);
+		mutex_unlock(&container->lock);
+
+		return ret;
+	}
+	case VFIO_IOMMU_UNREGISTER_MEMORY: {
+		struct vfio_iommu_type1_unregister_memory param;
+
+		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
+				size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		/* No flag is supported now */
+		if (param.flags)
+			return -EINVAL;
+
+		mutex_lock(&container->lock);
+		tce_unregister_pages(container, param.vaddr, param.size);
+		mutex_unlock(&container->lock);
+
+		return 0;
+	}
 	case VFIO_IOMMU_ENABLE:
 		mutex_lock(&container->lock);
 		ret = tce_iommu_enable(container);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 29715d2..2bb0c9b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get simpler.
+ */
+struct vfio_iommu_type1_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
+ *
+ * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
+ */
+struct vfio_iommu_type1_unregister_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

The existing implementation accounts the whole DMA window in
the locked_vm counter which is going to be even worse with multiple
containers and huge DMA windows.

This introduces 2 ioctls to register/unregister DMA memory which
receive user space address and size of the memory region which
needs to be pinned/unpinned and counted in locked_vm.

If any memory region was registered, all subsequent DMA map requests
should address already pinned memory. If no memory was registered,
then the amount of memory required for a single default memory will be
accounted when the container is enabled and every map/unmap will pin/unpin
a page.

Dynamic DMA window and in-kernel acceleration will require memory to
be registered in order to work.

The accounting is done per VFIO container. When the support of
multiple groups per container is added, we will have accurate locked_vm
accounting.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h           |  29 ++++
 2 files changed, 331 insertions(+), 31 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8256275..d0987ae 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -86,8 +86,169 @@ struct tce_container {
 	struct mutex lock;
 	struct iommu_group *grp;
 	bool enabled;
+	struct list_head mem_list;
 };
 
+struct tce_memory {
+	struct list_head next;
+	struct rcu_head rcu;
+	__u64 vaddr;
+	__u64 size;
+	__u64 pfns[];
+};
+
+static void tce_unpin_pages(struct tce_container *container,
+		struct tce_memory *mem, __u64 vaddr, __u64 size)
+{
+	__u64 off;
+	struct page *page = NULL;
+
+
+	for (off = 0; off < size; off += PAGE_SIZE) {
+		if (!mem->pfns[off >> PAGE_SHIFT])
+			continue;
+
+		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
+		if (!page)
+			continue;
+
+		put_page(page);
+		mem->pfns[off >> PAGE_SHIFT] = 0;
+	}
+}
+
+static void release_tce_memory(struct rcu_head *head)
+{
+	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
+
+	kfree(mem);
+}
+
+static void tce_do_unregister_pages(struct tce_container *container,
+		struct tce_memory *mem)
+{
+	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
+	decrement_locked_vm(mem->size);
+	list_del_rcu(&mem->next);
+	call_rcu_sched(&mem->rcu, release_tce_memory);
+}
+
+static long tce_unregister_pages(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	struct tce_memory *mem, *memtmp;
+
+	if (container->enabled)
+		return -EBUSY;
+
+	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
+		return -EINVAL;
+
+	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
+		if ((mem->vaddr == vaddr) && (mem->size == size)) {
+			tce_do_unregister_pages(container, mem);
+			return 0;
+		}
+	}
+
+	return -ENOENT;
+}
+
+static long tce_pin_pages(struct tce_container *container,
+		struct tce_memory *mem, __u64 vaddr, __u64 size)
+{
+	__u64 off;
+	struct page *page = NULL;
+
+	for (off = 0; off < size; off += PAGE_SIZE) {
+		if (1 != get_user_pages_fast(vaddr + off,
+					1/* pages */, 1/* iswrite */, &page)) {
+			tce_unpin_pages(container, mem, vaddr, off);
+			return -EFAULT;
+		}
+
+		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
+	}
+
+	return 0;
+}
+
+static long tce_register_pages(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	long ret;
+	struct tce_memory *mem;
+
+	if (container->enabled)
+		return -EBUSY;
+
+	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
+			((vaddr + size) < vaddr))
+		return -EINVAL;
+
+	/* Any overlap with registered chunks? */
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr < (vaddr + size)) &&
+				(vaddr < (mem->vaddr + mem->size))) {
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+	}
+
+	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
+	if (ret)
+		goto unlock_exit;
+
+	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);
+	if (!mem)
+		goto unlock_exit;
+
+	if (tce_pin_pages(container, mem, vaddr, size))
+		goto free_exit;
+
+	mem->vaddr = vaddr;
+	mem->size = size;
+
+	list_add_rcu(&mem->next, &container->mem_list);
+	rcu_read_unlock();
+
+	return 0;
+
+free_exit:
+	kfree(mem);
+
+unlock_exit:
+	decrement_locked_vm(size >> PAGE_SHIFT);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool tce_preregistered(struct tce_container *container)
+{
+	return !list_empty(&container->mem_list);
+}
+
+static bool tce_pinned(struct tce_container *container,
+		__u64 vaddr, __u64 size)
+{
+	struct tce_memory *mem;
+	bool ret = false;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr <= vaddr) &&
+				(vaddr + size <= mem->vaddr + mem->size)) {
+			ret = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static bool tce_check_page_size(struct page *page, unsigned page_shift)
 {
 	unsigned shift;
@@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * as this information is only available from KVM and VFIO is
 	 * KVM agnostic.
 	 */
-	iommu = iommu_group_get_iommudata(container->grp);
-	if (!iommu)
-		return -EFAULT;
+	if (!tce_preregistered(container)) {
+		iommu = iommu_group_get_iommudata(container->grp);
+		if (!iommu)
+			return -EFAULT;
 
-	tbl = &iommu->tables[0];
-	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
-	if (ret)
-		return ret;
+		tbl = &iommu->tables[0];
+		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		if (ret)
+			return ret;
+	}
 
 	container->enabled = true;
 
@@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!container->grp || !current->mm)
 		return;
 
-	iommu = iommu_group_get_iommudata(container->grp);
-	if (!iommu)
-		return;
+	if (!tce_preregistered(container)) {
+		iommu = iommu_group_get_iommudata(container->grp);
+		if (!iommu)
+			return;
 
-	tbl = &iommu->tables[0];
-	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		tbl = &iommu->tables[0];
+		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+	}
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
 		return ERR_PTR(-ENOMEM);
 
 	mutex_init(&container->lock);
+	INIT_LIST_HEAD_RCU(&container->mem_list);
 
 	return container;
 }
@@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
 static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
+	struct tce_memory *mem, *memtmp;
 
 	WARN_ON(container->grp);
 	tce_iommu_disable(container);
@@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
 	if (container->grp)
 		tce_iommu_detach_group(iommu_data, container->grp);
 
+	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
+		tce_do_unregister_pages(container, mem);
+
 	mutex_destroy(&container->lock);
 
 	kfree(container);
 }
 
-static void tce_iommu_unuse_page(unsigned long oldtce)
+static void tce_iommu_unuse_page(struct tce_container *container,
+		unsigned long oldtce)
 {
 	struct page *page;
+	bool do_put = !tce_preregistered(container);
 
 	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
 		return;
@@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
 	if (oldtce & TCE_PCI_WRITE)
 		SetPageDirty(page);
 
-	put_page(page);
+	if (do_put)
+		put_page(page);
 }
 
 static int tce_iommu_clear(struct tce_container *container,
@@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
 		if (ret)
 			continue;
 
-		tce_iommu_unuse_page(oldtce);
+		tce_iommu_unuse_page(container, oldtce);
 	}
 
 	return 0;
@@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
 		return DMA_NONE;
 }
 
+static unsigned long tce_get_hva_cached(struct tce_container *container,
+		unsigned page_shift, unsigned long tce)
+{
+	struct tce_memory *mem;
+	struct page *page = NULL;
+	unsigned long hva = -1;
+
+	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+	rcu_read_lock();
+	list_for_each_entry_rcu(mem, &container->mem_list, next) {
+		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
+			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
+			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
+
+			page = pfn_to_page(mem->pfns[gfn]);
+
+			if (!tce_check_page_size(page, page_shift))
+				break;
+
+			hva = (unsigned long) __va(hpa);
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return hva;
+}
+
+static unsigned long tce_get_hva(struct tce_container *container,
+		unsigned page_shift, unsigned long tce)
+{
+	long ret = 0;
+	struct page *page = NULL;
+	unsigned long hva = -1;
+	enum dma_data_direction direction = tce_iommu_direction(tce);
+
+	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+			direction != DMA_TO_DEVICE, &page);
+	if (unlikely(ret != 1))
+		return -1;
+
+	if (!tce_check_page_size(page, page_shift)) {
+		put_page(page);
+		return -1;
+	}
+
+	hva = (unsigned long) page_address(page) +
+		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
+
+	return hva;
+}
+
 static long tce_iommu_build(struct tce_container *container,
 		struct iommu_table *tbl,
 		unsigned long entry, unsigned long tce, unsigned long pages)
 {
 	long i, ret = 0;
-	struct page *page = NULL;
 	unsigned long hva, oldtce;
 	enum dma_data_direction direction = tce_iommu_direction(tce);
+	bool do_put = false;
 
 	for (i = 0; i < pages; ++i) {
-		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-				direction != DMA_TO_DEVICE, &page);
-		if (unlikely(ret != 1)) {
-			ret = -EFAULT;
-			break;
+		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
+		if (hva == -1) {
+			do_put = true;
+			WARN_ON_ONCE(1);
+			hva = tce_get_hva(container, tbl->it_page_shift, tce);
 		}
 
-		if (!tce_check_page_size(page, tbl->it_page_shift)) {
-			ret = -EFAULT;
-			break;
-		}
-
-		hva = (unsigned long) page_address(page) +
-			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
 		oldtce = 0;
-
 		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
 		if (ret) {
-			put_page(page);
+			if (do_put)
+				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
 					__func__, entry << tbl->it_page_shift,
 					tce, ret);
 			break;
 		}
 
-		tce_iommu_unuse_page(oldtce);
+		if (do_put)
+			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
+
+		tce_iommu_unuse_page(container, oldtce);
+
 		tce += IOMMU_PAGE_SIZE(tbl);
 	}
 
@@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
+		/* If any memory is pinned, only allow pages from that region */
+		if (tce_preregistered(container) &&
+				!tce_pinned(container, param.vaddr, param.size))
+			return -EPERM;
+
 		ret = tce_iommu_build(container, tbl,
 				param.iova >> tbl->it_page_shift,
 				tce, param.size >> tbl->it_page_shift);
@@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 		return ret;
 	}
+	case VFIO_IOMMU_REGISTER_MEMORY: {
+		struct vfio_iommu_type1_register_memory param;
+
+		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
+				size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		/* No flag is supported now */
+		if (param.flags)
+			return -EINVAL;
+
+		mutex_lock(&container->lock);
+		ret = tce_register_pages(container, param.vaddr, param.size);
+		mutex_unlock(&container->lock);
+
+		return ret;
+	}
+	case VFIO_IOMMU_UNREGISTER_MEMORY: {
+		struct vfio_iommu_type1_unregister_memory param;
+
+		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
+				size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		/* No flag is supported now */
+		if (param.flags)
+			return -EINVAL;
+
+		mutex_lock(&container->lock);
+		tce_unregister_pages(container, param.vaddr, param.size);
+		mutex_unlock(&container->lock);
+
+		return 0;
+	}
 	case VFIO_IOMMU_ENABLE:
 		mutex_lock(&container->lock);
 		ret = tce_iommu_enable(container);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 29715d2..2bb0c9b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get simpler.
+ */
+struct vfio_iommu_type1_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
+ *
+ * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
+ */
+struct vfio_iommu_type1_unregister_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 15/24] poweppc/powernv/ioda2: Rework iommu_table creation
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This moves iommu_table creation to the beginning. This is a mechanical
patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6d279d5..ebfea0a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1393,27 +1393,31 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	addr = page_address(tce_mem);
 	memset(addr, 0, tce_table_size);
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
+	/* Setup linux iommu table */
+	tbl = &pe->iommu.tables[0];
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+			IOMMU_PAGE_SHIFT_4K);
+
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+	iommu_init_table(tbl, phb->hose->node);
+	pe->iommu.ops = &pnv_pci_ioda2_ops;
+
 	/*
 	 * Map TCE table through TVT. The TVE index is the PE number
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-					pe->pe_number << 1, 1, __pa(addr),
-					tce_table_size, 0x1000);
+			pe->pe_number << 1, 1, __pa(tbl->it_base),
+			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
 		goto fail;
 	}
 
-	/* Setup iommu */
-	pe->iommu.tables[0].it_iommu = &pe->iommu;
-
-	/* Setup linux iommu table */
-	tbl = &pe->iommu.tables[0];
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			IOMMU_PAGE_SHIFT_4K);
-
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
 	if (swinvp) {
@@ -1427,14 +1431,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, phb->hose->node);
-	pe->iommu.ops = &pnv_pci_ioda2_ops;
+
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
 
 	if (pe->pdev)
-		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
+		set_iommu_table_base_and_group(&pe->pdev->dev,
+				&pe->iommu.tables[0]);
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 15/24] poweppc/powernv/ioda2: Rework iommu_table creation
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This moves iommu_table creation to the beginning. This is a mechanical
patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6d279d5..ebfea0a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1393,27 +1393,31 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	addr = page_address(tce_mem);
 	memset(addr, 0, tce_table_size);
 
+	/* Setup iommu */
+	pe->iommu.tables[0].it_iommu = &pe->iommu;
+
+	/* Setup linux iommu table */
+	tbl = &pe->iommu.tables[0];
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+			IOMMU_PAGE_SHIFT_4K);
+
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+	iommu_init_table(tbl, phb->hose->node);
+	pe->iommu.ops = &pnv_pci_ioda2_ops;
+
 	/*
 	 * Map TCE table through TVT. The TVE index is the PE number
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-					pe->pe_number << 1, 1, __pa(addr),
-					tce_table_size, 0x1000);
+			pe->pe_number << 1, 1, __pa(tbl->it_base),
+			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
 		goto fail;
 	}
 
-	/* Setup iommu */
-	pe->iommu.tables[0].it_iommu = &pe->iommu;
-
-	/* Setup linux iommu table */
-	tbl = &pe->iommu.tables[0];
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			IOMMU_PAGE_SHIFT_4K);
-
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
 	if (swinvp) {
@@ -1427,14 +1431,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, phb->hose->node);
-	pe->iommu.ops = &pnv_pci_ioda2_ops;
+
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
 			pe->pe_number);
 
 	if (pe->pdev)
-		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
+		set_iommu_table_base_and_group(&pe->pdev->dev,
+				&pe->iommu.tables[0]);
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 16/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.

This is a mechanical patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 88 +++++++++++++++++++++++--------
 1 file changed, 65 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ebfea0a..95d9119 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1295,6 +1295,62 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+		__u32 page_shift, __u32 window_shift,
+		struct iommu_table *tbl)
+{
+	int nid = pe->phb->hose->node;
+	struct page *tce_mem = NULL;
+	void *addr;
+	unsigned long tce_table_size;
+	int64_t rc;
+	unsigned order;
+
+	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
+		return -EINVAL;
+
+	if ((1ULL << window_shift) > memory_hotplug_max())
+		return -EINVAL;
+
+	tce_table_size = (1ULL << (window_shift - page_shift)) * 8;
+	tce_table_size = max(0x1000UL, tce_table_size);
+
+	/* Allocate TCE table */
+	order = get_order(tce_table_size);
+
+	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+		rc = -ENOMEM;
+		goto fail;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, tce_table_size);
+
+	/* Setup linux iommu table */
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+			page_shift);
+
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+	iommu_init_table(tbl, nid);
+
+	return 0;
+fail:
+	if (tce_mem)
+		__free_pages(tce_mem, get_order(tce_table_size));
+
+	return rc;
+}
+
+static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+	if (!tbl->it_size)
+		return;
+
+	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+	memset(tbl, 0, sizeof(struct iommu_table));
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1365,11 +1421,9 @@ static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
-	struct page *tce_mem = NULL;
-	void *addr;
 	const __be64 *swinvp;
-	struct iommu_table *tbl;
-	unsigned int tce_table_size, end;
+	unsigned int end;
+	struct iommu_table *tbl = &pe->iommu.tables[0];
 	int64_t rc;
 
 	/* We shouldn't already have a 32-bit DMA associated */
@@ -1378,31 +1432,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* The PE will reserve all possible 32-bits space */
 	pe->tce32_seg = 0;
+
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
-	tce_table_size = (end / 0x1000) * 8;
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	/* Allocate TCE table */
-	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
-				   get_order(tce_table_size));
-	if (!tce_mem) {
-		pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
-		goto fail;
+	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+			ilog2(phb->ioda.m32_pci_base), tbl);
+	if (rc) {
+		pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
+		return;
 	}
-	addr = page_address(tce_mem);
-	memset(addr, 0, tce_table_size);
 
 	/* Setup iommu */
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
-
-	/* Setup linux iommu table */
-	tbl = &pe->iommu.tables[0];
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			IOMMU_PAGE_SHIFT_4K);
-
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, phb->hose->node);
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
 	/*
@@ -1447,8 +1490,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 fail:
 	if (pe->tce32_seg >= 0)
 		pe->tce32_seg = -1;
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
+	pnv_pci_ioda2_free_table(tbl);
 }
 
 static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 16/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.

This is a mechanical patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 88 +++++++++++++++++++++++--------
 1 file changed, 65 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ebfea0a..95d9119 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1295,6 +1295,62 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+		__u32 page_shift, __u32 window_shift,
+		struct iommu_table *tbl)
+{
+	int nid = pe->phb->hose->node;
+	struct page *tce_mem = NULL;
+	void *addr;
+	unsigned long tce_table_size;
+	int64_t rc;
+	unsigned order;
+
+	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
+		return -EINVAL;
+
+	if ((1ULL << window_shift) > memory_hotplug_max())
+		return -EINVAL;
+
+	tce_table_size = (1ULL << (window_shift - page_shift)) * 8;
+	tce_table_size = max(0x1000UL, tce_table_size);
+
+	/* Allocate TCE table */
+	order = get_order(tce_table_size);
+
+	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+		rc = -ENOMEM;
+		goto fail;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, tce_table_size);
+
+	/* Setup linux iommu table */
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+			page_shift);
+
+	tbl->it_ops = &pnv_ioda2_iommu_ops;
+	iommu_init_table(tbl, nid);
+
+	return 0;
+fail:
+	if (tce_mem)
+		__free_pages(tce_mem, get_order(tce_table_size));
+
+	return rc;
+}
+
+static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+	if (!tbl->it_size)
+		return;
+
+	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+	memset(tbl, 0, sizeof(struct iommu_table));
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1365,11 +1421,9 @@ static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
-	struct page *tce_mem = NULL;
-	void *addr;
 	const __be64 *swinvp;
-	struct iommu_table *tbl;
-	unsigned int tce_table_size, end;
+	unsigned int end;
+	struct iommu_table *tbl = &pe->iommu.tables[0];
 	int64_t rc;
 
 	/* We shouldn't already have a 32-bit DMA associated */
@@ -1378,31 +1432,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* The PE will reserve all possible 32-bits space */
 	pe->tce32_seg = 0;
+
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
-	tce_table_size = (end / 0x1000) * 8;
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	/* Allocate TCE table */
-	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL,
-				   get_order(tce_table_size));
-	if (!tce_mem) {
-		pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
-		goto fail;
+	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+			ilog2(phb->ioda.m32_pci_base), tbl);
+	if (rc) {
+		pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
+		return;
 	}
-	addr = page_address(tce_mem);
-	memset(addr, 0, tce_table_size);
 
 	/* Setup iommu */
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
-
-	/* Setup linux iommu table */
-	tbl = &pe->iommu.tables[0];
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			IOMMU_PAGE_SHIFT_4K);
-
-	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, phb->hose->node);
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
 	/*
@@ -1447,8 +1490,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 fail:
 	if (pe->tce32_seg >= 0)
 		pe->tce32_seg = -1;
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
+	pnv_pci_ioda2_free_table(tbl);
 }
 
 static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 17/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This is a part of moving DMA window programming to an iommu_ops
callback.

This is a mechanical patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 ++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 95d9119..1f725d4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1351,6 +1351,57 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	memset(tbl, 0, sizeof(struct iommu_table));
 }
 
+static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+		struct iommu_table *tbl)
+{
+	struct pnv_phb *phb = pe->phb;
+	const __be64 *swinvp;
+	int64_t rc;
+	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+			start_addr, start_addr + win_size - 1,
+			1UL << tbl->it_page_shift, tbl->it_size << 3);
+
+	pe->iommu.tables[0] = *tbl;
+	tbl = &pe->iommu.tables[0];
+	tbl->it_iommu = &pe->iommu;
+
+	/*
+	 * Map TCE table through TVT. The TVE index is the PE number
+	 * shifted by 1 bit for 32-bits DMA space.
+	 */
+	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+			pe->pe_number << 1, 1, __pa(tbl->it_base),
+			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+	if (rc) {
+		pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
+		goto fail;
+	}
+
+	/* OPAL variant of PHB3 invalidated TCEs */
+	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
+	if (swinvp) {
+		/* We need a couple more fields -- an address and a data
+		 * to or.  Since the bus is only printed out on table free
+		 * errors, and on the first pass the data will be a relative
+		 * bus number, print that out instead.
+		 */
+		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+				8);
+		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+	}
+
+	return 0;
+fail:
+	if (pe->tce32_seg >= 0)
+		pe->tce32_seg = -1;
+
+	return rc;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1421,7 +1472,6 @@ static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
-	const __be64 *swinvp;
 	unsigned int end;
 	struct iommu_table *tbl = &pe->iommu.tables[0];
 	int64_t rc;
@@ -1448,31 +1498,14 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	/*
-	 * Map TCE table through TVT. The TVE index is the PE number
-	 * shifted by 1 bit for 32-bits DMA space.
-	 */
-	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, 1, __pa(tbl->it_base),
-			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+	rc = pnv_pci_ioda2_set_window(pe, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-		goto fail;
-	}
-
-	/* OPAL variant of PHB3 invalidated TCEs */
-	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
-	if (swinvp) {
-		/* We need a couple more fields -- an address and a data
-		 * to or.  Since the bus is only printed out on table free
-		 * errors, and on the first pass the data will be a relative
-		 * bus number, print that out instead.
-		 */
-		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
-		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
-				8);
-		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+		pnv_pci_ioda2_free_table(tbl);
+		if (pe->tce32_seg >= 0)
+			pe->tce32_seg = -1;
+		return;
 	}
 
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
@@ -1486,11 +1519,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-	return;
-fail:
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
-	pnv_pci_ioda2_free_table(tbl);
 }
 
 static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 17/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This is a part of moving DMA window programming to an iommu_ops
callback.

This is a mechanical patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 ++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 95d9119..1f725d4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1351,6 +1351,57 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	memset(tbl, 0, sizeof(struct iommu_table));
 }
 
+static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+		struct iommu_table *tbl)
+{
+	struct pnv_phb *phb = pe->phb;
+	const __be64 *swinvp;
+	int64_t rc;
+	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+			start_addr, start_addr + win_size - 1,
+			1UL << tbl->it_page_shift, tbl->it_size << 3);
+
+	pe->iommu.tables[0] = *tbl;
+	tbl = &pe->iommu.tables[0];
+	tbl->it_iommu = &pe->iommu;
+
+	/*
+	 * Map TCE table through TVT. The TVE index is the PE number
+	 * shifted by 1 bit for 32-bits DMA space.
+	 */
+	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+			pe->pe_number << 1, 1, __pa(tbl->it_base),
+			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+	if (rc) {
+		pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
+		goto fail;
+	}
+
+	/* OPAL variant of PHB3 invalidated TCEs */
+	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
+	if (swinvp) {
+		/* We need a couple more fields -- an address and a data
+		 * to or.  Since the bus is only printed out on table free
+		 * errors, and on the first pass the data will be a relative
+		 * bus number, print that out instead.
+		 */
+		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
+		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
+				8);
+		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+	}
+
+	return 0;
+fail:
+	if (pe->tce32_seg >= 0)
+		pe->tce32_seg = -1;
+
+	return rc;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1421,7 +1472,6 @@ static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
-	const __be64 *swinvp;
 	unsigned int end;
 	struct iommu_table *tbl = &pe->iommu.tables[0];
 	int64_t rc;
@@ -1448,31 +1498,14 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	/*
-	 * Map TCE table through TVT. The TVE index is the PE number
-	 * shifted by 1 bit for 32-bits DMA space.
-	 */
-	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, 1, __pa(tbl->it_base),
-			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+	rc = pnv_pci_ioda2_set_window(pe, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-		goto fail;
-	}
-
-	/* OPAL variant of PHB3 invalidated TCEs */
-	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
-	if (swinvp) {
-		/* We need a couple more fields -- an address and a data
-		 * to or.  Since the bus is only printed out on table free
-		 * errors, and on the first pass the data will be a relative
-		 * bus number, print that out instead.
-		 */
-		pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
-		tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
-				8);
-		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+		pnv_pci_ioda2_free_table(tbl);
+		if (pe->tce32_seg >= 0)
+			pe->tce32_seg = -1;
+		return;
 	}
 
 	iommu_register_group(&pe->iommu, phb->hose->global_number,
@@ -1486,11 +1519,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-	return;
-fail:
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
-	pnv_pci_ioda2_free_table(tbl);
 }
 
 static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 18/24] powerpc/iommu: Split iommu_free_table into 2 helpers
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

The iommu_free_table helper release memory it is using (the TCE table and
@it_map) and release the iommu_table struct as well. We might not want
the very last step as we store iommu_table in parent structures.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c      | 57 ++++++++++++++++++++++++----------------
 2 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index bf26d47..cc26eca 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -122,6 +122,7 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern struct iommu_table *iommu_table_alloc(int node);
 /* Frees table for an individual device node */
+extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
 extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 
 /* Initializes an iommu_table based in values set in the passed-in
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 26feaff..5f87076 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node)
 	return &iommu->tables[0];
 }
 
+void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
+{
+	if (!tbl)
+		return;
+
+	if (tbl->it_map) {
+		unsigned long bitmap_sz;
+		unsigned int order;
+
+		/*
+		 * In case we have reserved the first bit, we should not emit
+		 * the warning below.
+		 */
+		if (tbl->it_offset == 0)
+			clear_bit(0, tbl->it_map);
+
+		/* verify that table contains no entries */
+		if (!bitmap_empty(tbl->it_map, tbl->it_size))
+			pr_warn("%s: Unexpected TCEs for %s\n", __func__,
+					node_name);
+
+		/* calculate bitmap size in bytes */
+		bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+
+		/* free bitmap */
+		order = get_order(bitmap_sz);
+		free_pages((unsigned long) tbl->it_map, order);
+	}
+
+	memset(tbl, 0, sizeof(*tbl));
+}
+
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
-	unsigned long bitmap_sz;
-	unsigned int order;
 	struct powerpc_iommu *iommu = tbl->it_iommu;
 
-	if (!tbl || !tbl->it_map) {
-		printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
-				node_name);
+	if (!tbl)
 		return;
-	}
 
-	/*
-	 * In case we have reserved the first bit, we should not emit
-	 * the warning below.
-	 */
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+	iommu_reset_table(tbl, node_name);
 
 #ifdef CONFIG_IOMMU_API
 	if (iommu->group) {
@@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	}
 #endif
 
-	/* verify that table contains no entries */
-	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
-
-	/* calculate bitmap size in bytes */
-	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
-
-	/* free bitmap */
-	order = get_order(bitmap_sz);
-	free_pages((unsigned long) tbl->it_map, order);
-
 	/* free table */
 	kfree(iommu);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 18/24] powerpc/iommu: Split iommu_free_table into 2 helpers
@ 2015-01-29  9:21   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

The iommu_free_table helper release memory it is using (the TCE table and
@it_map) and release the iommu_table struct as well. We might not want
the very last step as we store iommu_table in parent structures.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c      | 57 ++++++++++++++++++++++++----------------
 2 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index bf26d47..cc26eca 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -122,6 +122,7 @@ static inline void *get_iommu_table_base(struct device *dev)
 
 extern struct iommu_table *iommu_table_alloc(int node);
 /* Frees table for an individual device node */
+extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name);
 extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 
 /* Initializes an iommu_table based in values set in the passed-in
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 26feaff..5f87076 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node)
 	return &iommu->tables[0];
 }
 
+void iommu_reset_table(struct iommu_table *tbl, const char *node_name)
+{
+	if (!tbl)
+		return;
+
+	if (tbl->it_map) {
+		unsigned long bitmap_sz;
+		unsigned int order;
+
+		/*
+		 * In case we have reserved the first bit, we should not emit
+		 * the warning below.
+		 */
+		if (tbl->it_offset == 0)
+			clear_bit(0, tbl->it_map);
+
+		/* verify that table contains no entries */
+		if (!bitmap_empty(tbl->it_map, tbl->it_size))
+			pr_warn("%s: Unexpected TCEs for %s\n", __func__,
+					node_name);
+
+		/* calculate bitmap size in bytes */
+		bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+
+		/* free bitmap */
+		order = get_order(bitmap_sz);
+		free_pages((unsigned long) tbl->it_map, order);
+	}
+
+	memset(tbl, 0, sizeof(*tbl));
+}
+
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
-	unsigned long bitmap_sz;
-	unsigned int order;
 	struct powerpc_iommu *iommu = tbl->it_iommu;
 
-	if (!tbl || !tbl->it_map) {
-		printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
-				node_name);
+	if (!tbl)
 		return;
-	}
 
-	/*
-	 * In case we have reserved the first bit, we should not emit
-	 * the warning below.
-	 */
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+	iommu_reset_table(tbl, node_name);
 
 #ifdef CONFIG_IOMMU_API
 	if (iommu->group) {
@@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 	}
 #endif
 
-	/* verify that table contains no entries */
-	if (!bitmap_empty(tbl->it_map, tbl->it_size))
-		pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
-
-	/* calculate bitmap size in bytes */
-	bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
-
-	/* free bitmap */
-	order = get_order(bitmap_sz);
-	free_pages((unsigned long) tbl->it_map, order);
-
 	/* free table */
 	kfree(iommu);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 19/24] powerpc/powernv: Implement multilevel TCE tables
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This adds multi-level TCE tables support to pnv_pci_ioda2_create_table()
and pnv_pci_ioda2_free_table() callbacks.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |   4 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 125 +++++++++++++++++++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  19 +++++
 3 files changed, 122 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cc26eca..283f70f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -85,6 +85,8 @@ struct iommu_pool {
 struct iommu_table {
 	unsigned long  it_busno;     /* Bus number this table belongs to */
 	unsigned long  it_size;      /* Size of iommu table in entries */
+	unsigned long  it_indirect_levels;
+	unsigned long  it_level_size;
 	unsigned long  it_offset;    /* Offset into global table */
 	unsigned long  it_base;      /* mapped address of tce table */
 	unsigned long  it_index;     /* which iommu table this is */
@@ -133,6 +135,8 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 
 #define POWERPC_IOMMU_MAX_TABLES	1
 
+#define POWERPC_IOMMU_DEFAULT_LEVELS	1
+
 struct powerpc_iommu;
 
 struct powerpc_iommu_ops {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1f725d4..f542819 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1295,16 +1295,79 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static void pnv_free_tce_table(unsigned long addr, unsigned size,
+		unsigned level)
+{
+	addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+	if (level) {
+		long i;
+		u64 *tmp = (u64 *) addr;
+
+		for (i = 0; i < size; ++i) {
+			unsigned long hpa = be64_to_cpu(tmp[i]);
+
+			if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+				continue;
+
+			pnv_free_tce_table((unsigned long) __va(hpa),
+					size, level - 1);
+		}
+	}
+
+	free_pages(addr, get_order(size << 3));
+}
+
+static __be64 *pnv_alloc_tce_table(int nid,
+		unsigned shift, unsigned levels, unsigned long *left)
+{
+	struct page *tce_mem = NULL;
+	__be64 *addr, *tmp;
+	unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+	unsigned long chunk = 1UL << shift, i;
+
+	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pr_err("Failed to allocate a TCE memory\n");
+		return NULL;
+	}
+
+	if (!*left)
+		return NULL;
+
+	addr = page_address(tce_mem);
+	memset(addr, 0, chunk);
+
+	--levels;
+	if (!levels) {
+		/* This is last level, actual TCEs */
+		*left -= min(*left, chunk);
+		return addr;
+	}
+
+	for (i = 0; i < (chunk >> 3); ++i) {
+		/* We allocated required TCEs, mark the rest "page fault" */
+		if (!*left) {
+			addr[i] = cpu_to_be64(0);
+			continue;
+		}
+
+		tmp = pnv_alloc_tce_table(nid, shift, levels, left);
+		addr[i] = cpu_to_be64(__pa(tmp) |
+				TCE_PCI_READ | TCE_PCI_WRITE);
+	}
+
+	return addr;
+}
+
 static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
-		__u32 page_shift, __u32 window_shift,
+		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
 	int nid = pe->phb->hose->node;
-	struct page *tce_mem = NULL;
 	void *addr;
-	unsigned long tce_table_size;
-	int64_t rc;
-	unsigned order;
+	unsigned long tce_table_size, left;
+	unsigned shift;
 
 	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
 		return -EINVAL;
@@ -1312,20 +1375,27 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
 	if ((1ULL << window_shift) > memory_hotplug_max())
 		return -EINVAL;
 
+	if (!levels || (levels > 5))
+		return -EINVAL;
+
 	tce_table_size = (1ULL << (window_shift - page_shift)) * 8;
 	tce_table_size = max(0x1000UL, tce_table_size);
 
 	/* Allocate TCE table */
-	order = get_order(tce_table_size);
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+	shift = ROUND_UP(window_shift - page_shift, levels) / levels;
+	shift += 3;
+	shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
+	pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
+			1ULL << window_shift, levels, 1UL << shift);
 
-	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
-	if (!tce_mem) {
-		pr_err("Failed to allocate a TCE memory, order=%d\n", order);
-		rc = -ENOMEM;
-		goto fail;
-	}
-	addr = page_address(tce_mem);
-	memset(addr, 0, tce_table_size);
+	tbl->it_level_size = 1ULL << (shift - 3);
+	left = tce_table_size;
+	addr = pnv_alloc_tce_table(nid, shift, levels, &left);
+	if (!addr)
+		return -ENOMEM;
+
+	tbl->it_indirect_levels = levels - 1;
 
 	/* Setup linux iommu table */
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
@@ -1335,20 +1405,18 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
 	iommu_init_table(tbl, nid);
 
 	return 0;
-fail:
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
-
-	return rc;
 }
 
 static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 {
+	const unsigned size = tbl->it_indirect_levels ?
+			tbl->it_level_size : tbl->it_size;
+
 	if (!tbl->it_size)
 		return;
 
-	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	memset(tbl, 0, sizeof(struct iommu_table));
+	pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels);
+	iommu_reset_table(tbl, "ioda2");
 }
 
 static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
@@ -1357,12 +1425,15 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
 	struct pnv_phb *phb = pe->phb;
 	const __be64 *swinvp;
 	int64_t rc;
+	const unsigned size = tbl->it_indirect_levels ?
+			tbl->it_level_size : tbl->it_size;
 	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
 	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
 
-	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx levels=%d levelsize=%x\n",
 			start_addr, start_addr + win_size - 1,
-			1UL << tbl->it_page_shift, tbl->it_size << 3);
+			1UL << tbl->it_page_shift, tbl->it_size,
+			tbl->it_indirect_levels + 1, tbl->it_level_size);
 
 	pe->iommu.tables[0] = *tbl;
 	tbl = &pe->iommu.tables[0];
@@ -1373,8 +1444,9 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, 1, __pa(tbl->it_base),
-			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+			pe->pe_number << 1, tbl->it_indirect_levels + 1,
+			__pa(tbl->it_base),
+			size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
 		pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
 		goto fail;
@@ -1488,7 +1560,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		end);
 
 	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
-			ilog2(phb->ioda.m32_pci_base), tbl);
+			ilog2(phb->ioda.m32_pci_base),
+			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index cf8206b..e98495a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -605,6 +605,25 @@ static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
 static __be64 *pnv_tce(struct iommu_table *tbl, long index)
 {
 	__be64 *tmp = ((__be64 *)tbl->it_base);
+	int  level = tbl->it_indirect_levels;
+	const long shift = ilog2(tbl->it_level_size);
+	unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+	if (index >= tbl->it_size)
+		return NULL;
+
+	while (level) {
+		int n = (index & mask) >> (level * shift);
+		unsigned long tce = be64_to_cpu(tmp[n]);
+
+		if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+			return NULL;
+
+		tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+		index &= ~mask;
+		mask >>= shift;
+		--level;
+	}
 
 	return tmp + index;
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 19/24] powerpc/powernv: Implement multilevel TCE tables
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This adds multi-level TCE tables support to pnv_pci_ioda2_create_table()
and pnv_pci_ioda2_free_table() callbacks.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          |   4 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 125 +++++++++++++++++++++++-------
 arch/powerpc/platforms/powernv/pci.c      |  19 +++++
 3 files changed, 122 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cc26eca..283f70f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -85,6 +85,8 @@ struct iommu_pool {
 struct iommu_table {
 	unsigned long  it_busno;     /* Bus number this table belongs to */
 	unsigned long  it_size;      /* Size of iommu table in entries */
+	unsigned long  it_indirect_levels;
+	unsigned long  it_level_size;
 	unsigned long  it_offset;    /* Offset into global table */
 	unsigned long  it_base;      /* mapped address of tce table */
 	unsigned long  it_index;     /* which iommu table this is */
@@ -133,6 +135,8 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 
 #define POWERPC_IOMMU_MAX_TABLES	1
 
+#define POWERPC_IOMMU_DEFAULT_LEVELS	1
+
 struct powerpc_iommu;
 
 struct powerpc_iommu_ops {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1f725d4..f542819 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1295,16 +1295,79 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static void pnv_free_tce_table(unsigned long addr, unsigned size,
+		unsigned level)
+{
+	addr &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+	if (level) {
+		long i;
+		u64 *tmp = (u64 *) addr;
+
+		for (i = 0; i < size; ++i) {
+			unsigned long hpa = be64_to_cpu(tmp[i]);
+
+			if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+				continue;
+
+			pnv_free_tce_table((unsigned long) __va(hpa),
+					size, level - 1);
+		}
+	}
+
+	free_pages(addr, get_order(size << 3));
+}
+
+static __be64 *pnv_alloc_tce_table(int nid,
+		unsigned shift, unsigned levels, unsigned long *left)
+{
+	struct page *tce_mem = NULL;
+	__be64 *addr, *tmp;
+	unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+	unsigned long chunk = 1UL << shift, i;
+
+	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pr_err("Failed to allocate a TCE memory\n");
+		return NULL;
+	}
+
+	if (!*left)
+		return NULL;
+
+	addr = page_address(tce_mem);
+	memset(addr, 0, chunk);
+
+	--levels;
+	if (!levels) {
+		/* This is last level, actual TCEs */
+		*left -= min(*left, chunk);
+		return addr;
+	}
+
+	for (i = 0; i < (chunk >> 3); ++i) {
+		/* We allocated required TCEs, mark the rest "page fault" */
+		if (!*left) {
+			addr[i] = cpu_to_be64(0);
+			continue;
+		}
+
+		tmp = pnv_alloc_tce_table(nid, shift, levels, left);
+		addr[i] = cpu_to_be64(__pa(tmp) |
+				TCE_PCI_READ | TCE_PCI_WRITE);
+	}
+
+	return addr;
+}
+
 static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
-		__u32 page_shift, __u32 window_shift,
+		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
 	int nid = pe->phb->hose->node;
-	struct page *tce_mem = NULL;
 	void *addr;
-	unsigned long tce_table_size;
-	int64_t rc;
-	unsigned order;
+	unsigned long tce_table_size, left;
+	unsigned shift;
 
 	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
 		return -EINVAL;
@@ -1312,20 +1375,27 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
 	if ((1ULL << window_shift) > memory_hotplug_max())
 		return -EINVAL;
 
+	if (!levels || (levels > 5))
+		return -EINVAL;
+
 	tce_table_size = (1ULL << (window_shift - page_shift)) * 8;
 	tce_table_size = max(0x1000UL, tce_table_size);
 
 	/* Allocate TCE table */
-	order = get_order(tce_table_size);
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+	shift = ROUND_UP(window_shift - page_shift, levels) / levels;
+	shift += 3;
+	shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
+	pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
+			1ULL << window_shift, levels, 1UL << shift);
 
-	tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
-	if (!tce_mem) {
-		pr_err("Failed to allocate a TCE memory, order=%d\n", order);
-		rc = -ENOMEM;
-		goto fail;
-	}
-	addr = page_address(tce_mem);
-	memset(addr, 0, tce_table_size);
+	tbl->it_level_size = 1ULL << (shift - 3);
+	left = tce_table_size;
+	addr = pnv_alloc_tce_table(nid, shift, levels, &left);
+	if (!addr)
+		return -ENOMEM;
+
+	tbl->it_indirect_levels = levels - 1;
 
 	/* Setup linux iommu table */
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
@@ -1335,20 +1405,18 @@ static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
 	iommu_init_table(tbl, nid);
 
 	return 0;
-fail:
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
-
-	return rc;
 }
 
 static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 {
+	const unsigned size = tbl->it_indirect_levels ?
+			tbl->it_level_size : tbl->it_size;
+
 	if (!tbl->it_size)
 		return;
 
-	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
-	memset(tbl, 0, sizeof(struct iommu_table));
+	pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels);
+	iommu_reset_table(tbl, "ioda2");
 }
 
 static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
@@ -1357,12 +1425,15 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
 	struct pnv_phb *phb = pe->phb;
 	const __be64 *swinvp;
 	int64_t rc;
+	const unsigned size = tbl->it_indirect_levels ?
+			tbl->it_level_size : tbl->it_size;
 	const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
 	const __u64 win_size = tbl->it_size << tbl->it_page_shift;
 
-	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx\n",
+	pe_info(pe, "Setting up window at %llx..%llx pagesize=0x%x tablesize=0x%lx levels=%d levelsize=%x\n",
 			start_addr, start_addr + win_size - 1,
-			1UL << tbl->it_page_shift, tbl->it_size << 3);
+			1UL << tbl->it_page_shift, tbl->it_size,
+			tbl->it_indirect_levels + 1, tbl->it_level_size);
 
 	pe->iommu.tables[0] = *tbl;
 	tbl = &pe->iommu.tables[0];
@@ -1373,8 +1444,9 @@ static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, 1, __pa(tbl->it_base),
-			tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+			pe->pe_number << 1, tbl->it_indirect_levels + 1,
+			__pa(tbl->it_base),
+			size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
 		pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
 		goto fail;
@@ -1488,7 +1560,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		end);
 
 	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
-			ilog2(phb->ioda.m32_pci_base), tbl);
+			ilog2(phb->ioda.m32_pci_base),
+			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc);
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index cf8206b..e98495a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -605,6 +605,25 @@ static unsigned long pnv_dmadir_to_flags(enum dma_data_direction direction)
 static __be64 *pnv_tce(struct iommu_table *tbl, long index)
 {
 	__be64 *tmp = ((__be64 *)tbl->it_base);
+	int  level = tbl->it_indirect_levels;
+	const long shift = ilog2(tbl->it_level_size);
+	unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+	if (index >= tbl->it_size)
+		return NULL;
+
+	while (level) {
+		int n = (index & mask) >> (level * shift);
+		unsigned long tce = be64_to_cpu(tmp[n]);
+
+		if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+			return NULL;
+
+		tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+		index &= ~mask;
+		mask >>= shift;
+		--level;
+	}
 
 	return tmp + index;
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 20/24] powerpc/powernv: Change prototypes to receive iommu
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This changes few functions to receive a powerpc_iommu pointer
rather than PE as they are going to be a part of upcoming
powerpc_iommu_ops callback set.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f542819..29bd7a4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1360,10 +1360,12 @@ static __be64 *pnv_alloc_tce_table(int nid,
 	return addr;
 }
 
-static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
 	int nid = pe->phb->hose->node;
 	void *addr;
 	unsigned long tce_table_size, left;
@@ -1419,9 +1421,11 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	iommu_reset_table(tbl, "ioda2");
 }
 
-static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
 	struct pnv_phb *phb = pe->phb;
 	const __be64 *swinvp;
 	int64_t rc;
@@ -1554,12 +1558,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* The PE will reserve all possible 32-bits space */
 	pe->tce32_seg = 0;
-
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+	rc = pnv_pci_ioda2_create_table(&pe->iommu, IOMMU_PAGE_SHIFT_4K,
 			ilog2(phb->ioda.m32_pci_base),
 			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
@@ -1571,7 +1574,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	rc = pnv_pci_ioda2_set_window(pe, tbl);
+	rc = pnv_pci_ioda2_set_window(&pe->iommu, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 20/24] powerpc/powernv: Change prototypes to receive iommu
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This changes few functions to receive a powerpc_iommu pointer
rather than PE as they are going to be a part of upcoming
powerpc_iommu_ops callback set.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f542819..29bd7a4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1360,10 +1360,12 @@ static __be64 *pnv_alloc_tce_table(int nid,
 	return addr;
 }
 
-static long pnv_pci_ioda2_create_table(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
 	int nid = pe->phb->hose->node;
 	void *addr;
 	unsigned long tce_table_size, left;
@@ -1419,9 +1421,11 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	iommu_reset_table(tbl, "ioda2");
 }
 
-static long pnv_pci_ioda2_set_window(struct pnv_ioda_pe *pe,
+static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 		struct iommu_table *tbl)
 {
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
 	struct pnv_phb *phb = pe->phb;
 	const __be64 *swinvp;
 	int64_t rc;
@@ -1554,12 +1558,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* The PE will reserve all possible 32-bits space */
 	pe->tce32_seg = 0;
-
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	rc = pnv_pci_ioda2_create_table(pe, IOMMU_PAGE_SHIFT_4K,
+	rc = pnv_pci_ioda2_create_table(&pe->iommu, IOMMU_PAGE_SHIFT_4K,
 			ilog2(phb->ioda.m32_pci_base),
 			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
@@ -1571,7 +1574,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	rc = pnv_pci_ioda2_set_window(pe, tbl);
+	rc = pnv_pci_ioda2_set_window(&pe->iommu, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 21/24] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This extends powerpc_iommu_ops by a set of callbacks to support dynamic
DMA windows management.

query() returns IOMMU capabilities such as default DMA window address and
supported number of DMA windows and TCE table levels.

create_table() creates a TCE table with specific parameters. For now
it receives powerpc_iommu to know nodeid in order to allocate TCE table
memory closer to the PHB. The exact format of allocated multi-level table
might be also specific to the PHB model (not the case now though).

set_window() sets the window at specified TVT index on PHB.

unset_window() unsets the window from specified TVT.

free_table() frees the memory occupied by a table.

The purpose of this separation is that we need to be able to create
one table and assign it to a set of PHB. This way we can support multiple
IOMMU groups in one VFIO container and make use of VFIO on SPAPR closer
to the way it works on x86.

This uses new helpers to remove the default TCE table if the ownership is
being taken and create it otherwise. So once an external user (such as
VFIO) obtained the ownership over a group, it does not have any DMA
windows, neither default 32bit not bypass window. The external user is
expected to unprogram DMA windows on PHBs before returning ownership
back to the kernel.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 31 ++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 98 ++++++++++++++++++++++++++-----
 2 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 283f70f..8393822 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -147,12 +147,43 @@ struct powerpc_iommu_ops {
 	 */
 	void (*set_ownership)(struct powerpc_iommu *iommu,
 			bool enable);
+
+	long (*create_table)(struct powerpc_iommu *iommu,
+			int num,
+			__u32 page_shift,
+			__u32 window_shift,
+			__u32 levels,
+			struct iommu_table *tbl);
+	long (*set_window)(struct powerpc_iommu *iommu,
+			int num,
+			struct iommu_table *tblnew);
+	long (*unset_window)(struct powerpc_iommu *iommu,
+			int num);
+	void (*free_table)(struct iommu_table *tbl);
 };
 
+/* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K           0x01
+#define DDW_PGSIZE_64K          0x02
+#define DDW_PGSIZE_16M          0x04
+#define DDW_PGSIZE_32M          0x08
+#define DDW_PGSIZE_64M          0x10
+#define DDW_PGSIZE_128M         0x20
+#define DDW_PGSIZE_256M         0x40
+#define DDW_PGSIZE_16G          0x80
+#define DDW_PGSIZE_MASK         0xFF
+
 struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *group;
 #endif
+	/* Some key properties of IOMMU */
+	__u32 tce32_start;
+	__u32 tce32_size;
+	__u32 windows_supported;
+	__u32 levels;
+	__u32 flags;
+
 	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
 	struct powerpc_iommu_ops *ops;
 };
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 29bd7a4..cf63ebb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1360,7 +1360,7 @@ static __be64 *pnv_alloc_tce_table(int nid,
 	return addr;
 }
 
-static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
+static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu, int num,
 		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
@@ -1388,8 +1388,8 @@ static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 	shift = ROUND_UP(window_shift - page_shift, levels) / levels;
 	shift += 3;
 	shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
-	pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
-			1ULL << window_shift, levels, 1UL << shift);
+	pr_info("Creating TCE table #%d %08llx, %d levels, TCE table size = %lx\n",
+			num, 1ULL << window_shift, levels, 1UL << shift);
 
 	tbl->it_level_size = 1ULL << (shift - 3);
 	left = tce_table_size;
@@ -1400,11 +1400,10 @@ static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 	tbl->it_indirect_levels = levels - 1;
 
 	/* Setup linux iommu table */
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			page_shift);
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size,
+			num ? pe->tce_bypass_base : 0, page_shift);
 
 	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, nid);
 
 	return 0;
 }
@@ -1421,8 +1420,18 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	iommu_reset_table(tbl, "ioda2");
 }
 
+static inline void pnv_pci_ioda2_tvt_invalidate(unsigned int pe_number,
+		unsigned long it_index)
+{
+	__be64 __iomem *invalidate = (__be64 __iomem *)it_index;
+	/* 01xb - invalidate TCEs that match the specified PE# */
+	unsigned long addr = (0x4ull << 60) | (pe_number & 0xFF);
+
+	__raw_writeq(cpu_to_be64(addr), invalidate);
+}
+
 static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
-		struct iommu_table *tbl)
+		int num, struct iommu_table *tbl)
 {
 	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
 						iommu);
@@ -1439,8 +1448,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 			1UL << tbl->it_page_shift, tbl->it_size,
 			tbl->it_indirect_levels + 1, tbl->it_level_size);
 
-	pe->iommu.tables[0] = *tbl;
-	tbl = &pe->iommu.tables[0];
+	pe->iommu.tables[num] = *tbl;
+	tbl = &pe->iommu.tables[num];
 	tbl->it_iommu = &pe->iommu;
 
 	/*
@@ -1448,7 +1457,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, tbl->it_indirect_levels + 1,
+			(pe->pe_number << 1) + num,
+			tbl->it_indirect_levels + 1,
 			__pa(tbl->it_base),
 			size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
@@ -1470,6 +1480,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 
+	pnv_pci_ioda2_tvt_invalidate(pe->pe_number, tbl->it_index);
+
 	return 0;
 fail:
 	if (pe->tce32_seg >= 0)
@@ -1478,6 +1490,28 @@ fail:
 	return rc;
 }
 
+static long pnv_pci_ioda2_unset_window(struct powerpc_iommu *iommu, int num)
+{
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
+	struct pnv_phb *phb = pe->phb;
+	long ret;
+
+	pe_info(pe, "Removing DMA window\n");
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+			(pe->pe_number << 1) + num,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+	pnv_pci_ioda2_tvt_invalidate(pe->pe_number,
+			iommu->tables[num].it_index);
+
+	return ret;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1533,16 +1567,42 @@ static void pnv_ioda2_set_ownership(struct powerpc_iommu *iommu,
 {
 	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
 						iommu);
-	if (enable)
-		iommu_take_ownership(iommu);
-	else
-		iommu_release_ownership(iommu);
+	if (enable) {
+		pnv_pci_ioda2_unset_window(&pe->iommu, 0);
+		pnv_pci_ioda2_free_table(&pe->iommu.tables[0]);
+	} else {
+		struct iommu_table *tbl = &pe->iommu.tables[0];
+		int64_t rc;
 
+		rc = pnv_pci_ioda2_create_table(&pe->iommu, 0,
+				IOMMU_PAGE_SHIFT_4K,
+				ilog2(pe->phb->ioda.m32_pci_base),
+				POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
+		if (rc) {
+			pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+					rc);
+			return;
+		}
+
+		iommu_init_table(tbl, pe->phb->hose->node);
+
+		rc = pnv_pci_ioda2_set_window(&pe->iommu, 0, tbl);
+		if (rc) {
+			pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
+					rc);
+			pnv_pci_ioda2_free_table(tbl);
+			return;
+		}
+	}
 	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
 static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 	.set_ownership = pnv_ioda2_set_ownership,
+	.create_table = pnv_pci_ioda2_create_table,
+	.set_window = pnv_pci_ioda2_set_window,
+	.unset_window = pnv_pci_ioda2_unset_window,
+	.free_table = pnv_pci_ioda2_free_table
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1562,7 +1622,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	rc = pnv_pci_ioda2_create_table(&pe->iommu, IOMMU_PAGE_SHIFT_4K,
+	rc = pnv_pci_ioda2_create_table(&pe->iommu, 0, IOMMU_PAGE_SHIFT_4K,
 			ilog2(phb->ioda.m32_pci_base),
 			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
@@ -1571,10 +1631,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup iommu */
+	pe->iommu.tce32_start = 0;
+	pe->iommu.tce32_size = phb->ioda.m32_pci_base;
+	pe->iommu.windows_supported = POWERPC_IOMMU_MAX_TABLES;
+	pe->iommu.levels = 5;
+	pe->iommu.flags = DDW_PGSIZE_4K | DDW_PGSIZE_64K | DDW_PGSIZE_16M;
+	iommu_init_table(tbl, pe->phb->hose->node);
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	rc = pnv_pci_ioda2_set_window(&pe->iommu, tbl);
+	rc = pnv_pci_ioda2_set_window(&pe->iommu, 0, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 21/24] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This extends powerpc_iommu_ops by a set of callbacks to support dynamic
DMA windows management.

query() returns IOMMU capabilities such as default DMA window address and
supported number of DMA windows and TCE table levels.

create_table() creates a TCE table with specific parameters. For now
it receives powerpc_iommu to know nodeid in order to allocate TCE table
memory closer to the PHB. The exact format of allocated multi-level table
might be also specific to the PHB model (not the case now though).

set_window() sets the window at specified TVT index on PHB.

unset_window() unsets the window from specified TVT.

free_table() frees the memory occupied by a table.

The purpose of this separation is that we need to be able to create
one table and assign it to a set of PHB. This way we can support multiple
IOMMU groups in one VFIO container and make use of VFIO on SPAPR closer
to the way it works on x86.

This uses new helpers to remove the default TCE table if the ownership is
being taken and create it otherwise. So once an external user (such as
VFIO) obtained the ownership over a group, it does not have any DMA
windows, neither default 32bit not bypass window. The external user is
expected to unprogram DMA windows on PHBs before returning ownership
back to the kernel.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h          | 31 ++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 98 ++++++++++++++++++++++++++-----
 2 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 283f70f..8393822 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -147,12 +147,43 @@ struct powerpc_iommu_ops {
 	 */
 	void (*set_ownership)(struct powerpc_iommu *iommu,
 			bool enable);
+
+	long (*create_table)(struct powerpc_iommu *iommu,
+			int num,
+			__u32 page_shift,
+			__u32 window_shift,
+			__u32 levels,
+			struct iommu_table *tbl);
+	long (*set_window)(struct powerpc_iommu *iommu,
+			int num,
+			struct iommu_table *tblnew);
+	long (*unset_window)(struct powerpc_iommu *iommu,
+			int num);
+	void (*free_table)(struct iommu_table *tbl);
 };
 
+/* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K           0x01
+#define DDW_PGSIZE_64K          0x02
+#define DDW_PGSIZE_16M          0x04
+#define DDW_PGSIZE_32M          0x08
+#define DDW_PGSIZE_64M          0x10
+#define DDW_PGSIZE_128M         0x20
+#define DDW_PGSIZE_256M         0x40
+#define DDW_PGSIZE_16G          0x80
+#define DDW_PGSIZE_MASK         0xFF
+
 struct powerpc_iommu {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *group;
 #endif
+	/* Some key properties of IOMMU */
+	__u32 tce32_start;
+	__u32 tce32_size;
+	__u32 windows_supported;
+	__u32 levels;
+	__u32 flags;
+
 	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
 	struct powerpc_iommu_ops *ops;
 };
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 29bd7a4..cf63ebb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1360,7 +1360,7 @@ static __be64 *pnv_alloc_tce_table(int nid,
 	return addr;
 }
 
-static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
+static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu, int num,
 		__u32 page_shift, __u32 window_shift, __u32 levels,
 		struct iommu_table *tbl)
 {
@@ -1388,8 +1388,8 @@ static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 	shift = ROUND_UP(window_shift - page_shift, levels) / levels;
 	shift += 3;
 	shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
-	pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
-			1ULL << window_shift, levels, 1UL << shift);
+	pr_info("Creating TCE table #%d %08llx, %d levels, TCE table size = %lx\n",
+			num, 1ULL << window_shift, levels, 1UL << shift);
 
 	tbl->it_level_size = 1ULL << (shift - 3);
 	left = tce_table_size;
@@ -1400,11 +1400,10 @@ static long pnv_pci_ioda2_create_table(struct powerpc_iommu *iommu,
 	tbl->it_indirect_levels = levels - 1;
 
 	/* Setup linux iommu table */
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-			page_shift);
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size,
+			num ? pe->tce_bypass_base : 0, page_shift);
 
 	tbl->it_ops = &pnv_ioda2_iommu_ops;
-	iommu_init_table(tbl, nid);
 
 	return 0;
 }
@@ -1421,8 +1420,18 @@ static void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 	iommu_reset_table(tbl, "ioda2");
 }
 
+static inline void pnv_pci_ioda2_tvt_invalidate(unsigned int pe_number,
+		unsigned long it_index)
+{
+	__be64 __iomem *invalidate = (__be64 __iomem *)it_index;
+	/* 01xb - invalidate TCEs that match the specified PE# */
+	unsigned long addr = (0x4ull << 60) | (pe_number & 0xFF);
+
+	__raw_writeq(cpu_to_be64(addr), invalidate);
+}
+
 static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
-		struct iommu_table *tbl)
+		int num, struct iommu_table *tbl)
 {
 	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
 						iommu);
@@ -1439,8 +1448,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 			1UL << tbl->it_page_shift, tbl->it_size,
 			tbl->it_indirect_levels + 1, tbl->it_level_size);
 
-	pe->iommu.tables[0] = *tbl;
-	tbl = &pe->iommu.tables[0];
+	pe->iommu.tables[num] = *tbl;
+	tbl = &pe->iommu.tables[num];
 	tbl->it_iommu = &pe->iommu;
 
 	/*
@@ -1448,7 +1457,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 	 * shifted by 1 bit for 32-bits DMA space.
 	 */
 	rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-			pe->pe_number << 1, tbl->it_indirect_levels + 1,
+			(pe->pe_number << 1) + num,
+			tbl->it_indirect_levels + 1,
 			__pa(tbl->it_base),
 			size << 3, 1ULL << tbl->it_page_shift);
 	if (rc) {
@@ -1470,6 +1480,8 @@ static long pnv_pci_ioda2_set_window(struct powerpc_iommu *iommu,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 
+	pnv_pci_ioda2_tvt_invalidate(pe->pe_number, tbl->it_index);
+
 	return 0;
 fail:
 	if (pe->tce32_seg >= 0)
@@ -1478,6 +1490,28 @@ fail:
 	return rc;
 }
 
+static long pnv_pci_ioda2_unset_window(struct powerpc_iommu *iommu, int num)
+{
+	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
+						iommu);
+	struct pnv_phb *phb = pe->phb;
+	long ret;
+
+	pe_info(pe, "Removing DMA window\n");
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+			(pe->pe_number << 1) + num,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+	pnv_pci_ioda2_tvt_invalidate(pe->pe_number,
+			iommu->tables[num].it_index);
+
+	return ret;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -1533,16 +1567,42 @@ static void pnv_ioda2_set_ownership(struct powerpc_iommu *iommu,
 {
 	struct pnv_ioda_pe *pe = container_of(iommu, struct pnv_ioda_pe,
 						iommu);
-	if (enable)
-		iommu_take_ownership(iommu);
-	else
-		iommu_release_ownership(iommu);
+	if (enable) {
+		pnv_pci_ioda2_unset_window(&pe->iommu, 0);
+		pnv_pci_ioda2_free_table(&pe->iommu.tables[0]);
+	} else {
+		struct iommu_table *tbl = &pe->iommu.tables[0];
+		int64_t rc;
 
+		rc = pnv_pci_ioda2_create_table(&pe->iommu, 0,
+				IOMMU_PAGE_SHIFT_4K,
+				ilog2(pe->phb->ioda.m32_pci_base),
+				POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
+		if (rc) {
+			pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+					rc);
+			return;
+		}
+
+		iommu_init_table(tbl, pe->phb->hose->node);
+
+		rc = pnv_pci_ioda2_set_window(&pe->iommu, 0, tbl);
+		if (rc) {
+			pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
+					rc);
+			pnv_pci_ioda2_free_table(tbl);
+			return;
+		}
+	}
 	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
 static struct powerpc_iommu_ops pnv_pci_ioda2_ops = {
 	.set_ownership = pnv_ioda2_set_ownership,
+	.create_table = pnv_pci_ioda2_create_table,
+	.set_window = pnv_pci_ioda2_set_window,
+	.unset_window = pnv_pci_ioda2_unset_window,
+	.free_table = pnv_pci_ioda2_free_table
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1562,7 +1622,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
 		end);
 
-	rc = pnv_pci_ioda2_create_table(&pe->iommu, IOMMU_PAGE_SHIFT_4K,
+	rc = pnv_pci_ioda2_create_table(&pe->iommu, 0, IOMMU_PAGE_SHIFT_4K,
 			ilog2(phb->ioda.m32_pci_base),
 			POWERPC_IOMMU_DEFAULT_LEVELS, tbl);
 	if (rc) {
@@ -1571,10 +1631,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup iommu */
+	pe->iommu.tce32_start = 0;
+	pe->iommu.tce32_size = phb->ioda.m32_pci_base;
+	pe->iommu.windows_supported = POWERPC_IOMMU_MAX_TABLES;
+	pe->iommu.levels = 5;
+	pe->iommu.flags = DDW_PGSIZE_4K | DDW_PGSIZE_64K | DDW_PGSIZE_16M;
+	iommu_init_table(tbl, pe->phb->hose->node);
 	pe->iommu.tables[0].it_iommu = &pe->iommu;
 	pe->iommu.ops = &pnv_pci_ioda2_ops;
 
-	rc = pnv_pci_ioda2_set_window(&pe->iommu, tbl);
+	rc = pnv_pci_ioda2_set_window(&pe->iommu, 0, tbl);
 	if (rc) {
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 22/24] powerpc/iommu: Get rid of ownership helpers
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

iommu_take_ownership/iommu_release_ownership used to be used to mark
bits in iommu_table::it_map. Since the IOMMU tables are recreated for
VFIO, it_map is always NULL.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h |  2 -
 arch/powerpc/kernel/iommu.c      | 96 ----------------------------------------
 2 files changed, 98 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 8393822..33009f9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -272,8 +272,6 @@ extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		enum dma_data_direction direction);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct powerpc_iommu *iommu);
-extern void iommu_release_ownership(struct powerpc_iommu *iommu);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f87076..6987115 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1007,102 +1007,6 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
-static int iommu_table_take_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-	int ret = 0;
-
-	/*
-	 * VFIO does not control TCE entries allocation and the guest
-	 * can write new TCEs on top of existing ones so iommu_tce_build()
-	 * must be able to release old pages. This functionality
-	 * requires exchange() callback defined so if it is not
-	 * implemented, we disallow taking ownership over the table.
-	 */
-	if (!tbl->it_ops->exchange)
-		return -EINVAL;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock(&tbl->pools[i].lock);
-
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
-
-	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
-		pr_err("iommu_tce: it_map is not empty");
-		ret = -EBUSY;
-		if (tbl->it_offset == 0)
-			set_bit(0, tbl->it_map);
-	} else {
-		memset(tbl->it_map, 0xff, sz);
-	}
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-
-	return 0;
-}
-
-static void iommu_table_release_ownership(struct iommu_table *tbl);
-
-int iommu_take_ownership(struct powerpc_iommu *iommu)
-{
-	int i, j, rc = 0;
-
-	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
-
-		if (!tbl->it_map)
-			continue;
-
-		rc = iommu_table_take_ownership(tbl);
-		if (rc) {
-			for (j = 0; j < i; ++j)
-				iommu_table_release_ownership(
-						&iommu->tables[j]);
-
-			return rc;
-		}
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_take_ownership);
-
-static void iommu_table_release_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock(&tbl->pools[i].lock);
-
-	memset(tbl->it_map, 0, sz);
-
-	/* Restore bit#0 set by iommu_init_table() */
-	if (tbl->it_offset == 0)
-		set_bit(0, tbl->it_map);
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-}
-
-extern void iommu_release_ownership(struct powerpc_iommu *iommu)
-{
-	int i;
-
-	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
-
-		if (tbl->it_map)
-			iommu_table_release_ownership(tbl);
-	}
-}
-EXPORT_SYMBOL_GPL(iommu_release_ownership);
-
 int iommu_add_device(struct device *dev)
 {
 	struct iommu_table *tbl;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 22/24] powerpc/iommu: Get rid of ownership helpers
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

iommu_take_ownership/iommu_release_ownership used to be used to mark
bits in iommu_table::it_map. Since the IOMMU tables are recreated for
VFIO, it_map is always NULL.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h |  2 -
 arch/powerpc/kernel/iommu.c      | 96 ----------------------------------------
 2 files changed, 98 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 8393822..33009f9 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -272,8 +272,6 @@ extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 		enum dma_data_direction direction);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct powerpc_iommu *iommu);
-extern void iommu_release_ownership(struct powerpc_iommu *iommu);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5f87076..6987115 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1007,102 +1007,6 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_xchg);
 
-static int iommu_table_take_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-	int ret = 0;
-
-	/*
-	 * VFIO does not control TCE entries allocation and the guest
-	 * can write new TCEs on top of existing ones so iommu_tce_build()
-	 * must be able to release old pages. This functionality
-	 * requires exchange() callback defined so if it is not
-	 * implemented, we disallow taking ownership over the table.
-	 */
-	if (!tbl->it_ops->exchange)
-		return -EINVAL;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock(&tbl->pools[i].lock);
-
-	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
-
-	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
-		pr_err("iommu_tce: it_map is not empty");
-		ret = -EBUSY;
-		if (tbl->it_offset == 0)
-			set_bit(0, tbl->it_map);
-	} else {
-		memset(tbl->it_map, 0xff, sz);
-	}
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-
-	return 0;
-}
-
-static void iommu_table_release_ownership(struct iommu_table *tbl);
-
-int iommu_take_ownership(struct powerpc_iommu *iommu)
-{
-	int i, j, rc = 0;
-
-	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
-
-		if (!tbl->it_map)
-			continue;
-
-		rc = iommu_table_take_ownership(tbl);
-		if (rc) {
-			for (j = 0; j < i; ++j)
-				iommu_table_release_ownership(
-						&iommu->tables[j]);
-
-			return rc;
-		}
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_take_ownership);
-
-static void iommu_table_release_ownership(struct iommu_table *tbl)
-{
-	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-	spin_lock_irqsave(&tbl->large_pool.lock, flags);
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_lock(&tbl->pools[i].lock);
-
-	memset(tbl->it_map, 0, sz);
-
-	/* Restore bit#0 set by iommu_init_table() */
-	if (tbl->it_offset == 0)
-		set_bit(0, tbl->it_map);
-
-	for (i = 0; i < tbl->nr_pools; i++)
-		spin_unlock(&tbl->pools[i].lock);
-	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
-}
-
-extern void iommu_release_ownership(struct powerpc_iommu *iommu)
-{
-	int i;
-
-	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
-
-		if (tbl->it_map)
-			iommu_table_release_ownership(tbl);
-	}
-}
-EXPORT_SYMBOL_GPL(iommu_release_ownership);
-
 int iommu_add_device(struct device *dev)
 {
 	struct iommu_table *tbl;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 23/24] vfio/spapr: Enable multiple groups in a container
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 243 +++++++++++++++++++++++-------------
 1 file changed, 155 insertions(+), 88 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0987ae..8bcafb7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -84,9 +84,15 @@ static void decrement_locked_vm(long npages)
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_group *grp;
 	bool enabled;
 	struct list_head mem_list;
+	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+	struct list_head group_list;
+};
+
+struct tce_iommu_group {
+	struct list_head next;
+	struct iommu_group *grp;
 };
 
 struct tce_memory {
@@ -265,17 +271,21 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 	return false;
 }
 
+static inline bool tce_groups_attached(struct tce_container *container)
+{
+	return !list_empty(&container->group_list);
+}
+
 static struct iommu_table *spapr_tce_find_table(
 		struct tce_container *container,
 		phys_addr_t ioba)
 {
 	long i;
 	struct iommu_table *ret = NULL;
-	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
 
 	mutex_lock(&container->lock);
 	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
+		struct iommu_table *tbl = &container->tables[i];
 		unsigned long entry = ioba >> tbl->it_page_shift;
 		unsigned long start = tbl->it_offset;
 		unsigned long end = start + tbl->it_size;
@@ -290,13 +300,31 @@ static struct iommu_table *spapr_tce_find_table(
 	return ret;
 }
 
+static unsigned long tce_default_winsize(struct tce_container *container)
+{
+	struct tce_iommu_group *tcegrp;
+	struct powerpc_iommu *iommu;
+
+	if (!tce_groups_attached(container))
+		return 0;
+
+	tcegrp = list_first_entry(&container->group_list,
+			struct tce_iommu_group, next);
+	if (!tcegrp)
+		return 0;
+
+	iommu = iommu_group_get_iommudata(tcegrp->grp);
+	if (!iommu)
+		return 0;
+
+	return iommu->tce32_size;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	struct powerpc_iommu *iommu;
-	struct iommu_table *tbl;
 
-	if (!container->grp)
+	if (!tce_groups_attached(container))
 		return -ENXIO;
 
 	if (container->enabled)
@@ -328,12 +356,8 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * KVM agnostic.
 	 */
 	if (!tce_preregistered(container)) {
-		iommu = iommu_group_get_iommudata(container->grp);
-		if (!iommu)
-			return -EFAULT;
-
-		tbl = &iommu->tables[0];
-		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		ret = try_increment_locked_vm(
+				tce_default_winsize(container) >> PAGE_SHIFT);
 		if (ret)
 			return ret;
 	}
@@ -343,27 +367,23 @@ static int tce_iommu_enable(struct tce_container *container)
 	return ret;
 }
 
+static int tce_iommu_clear(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long pages);
+
 static void tce_iommu_disable(struct tce_container *container)
 {
-	struct powerpc_iommu *iommu;
-	struct iommu_table *tbl;
-
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->grp || !current->mm)
+	if (!current->mm)
 		return;
 
-	if (!tce_preregistered(container)) {
-		iommu = iommu_group_get_iommudata(container->grp);
-		if (!iommu)
-			return;
-
-		tbl = &iommu->tables[0];
-		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
-	}
+	if (!tce_preregistered(container))
+		decrement_locked_vm(
+				tce_default_winsize(container) >> PAGE_SHIFT);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -381,20 +401,44 @@ static void *tce_iommu_open(unsigned long arg)
 
 	mutex_init(&container->lock);
 	INIT_LIST_HEAD_RCU(&container->mem_list);
+	INIT_LIST_HEAD_RCU(&container->group_list);
 
 	return container;
 }
 
 static void tce_iommu_release(void *iommu_data)
 {
+	int i;
+	struct powerpc_iommu *iommu;
+	struct tce_iommu_group *tcegrp;
 	struct tce_container *container = iommu_data;
 	struct tce_memory *mem, *memtmp;
+	struct powerpc_iommu_ops *iommuops = NULL;
 
-	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->grp)
-		tce_iommu_detach_group(iommu_data, container->grp);
+	while (tce_groups_attached(container)) {
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+		iommuops = iommu->ops;
+		tce_iommu_detach_group(iommu_data, tcegrp->grp);
+	}
+
+	/* Free tables */
+	if (iommuops) {
+		for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+			struct iommu_table *tbl = &container->tables[i];
+
+			tce_iommu_clear(container, tbl,
+					tbl->it_offset, tbl->it_size);
+
+			if (!tce_preregistered(container))
+				decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+
+			iommuops->free_table(tbl);
+		}
+	}
 
 	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
 		tce_do_unregister_pages(container, mem);
@@ -568,16 +612,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl;
+		struct tce_iommu_group *tcegrp;
 		struct powerpc_iommu *iommu;
 
-		if (WARN_ON(!container->grp))
+		if (!tce_groups_attached(container))
 			return -ENXIO;
 
-		iommu = iommu_group_get_iommudata(container->grp);
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
 
-		tbl = &iommu->tables[0];
-		if (WARN_ON_ONCE(!tbl))
+		if (!iommu)
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -589,9 +634,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
-		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
-		info.flags = 0;
+		info.dma32_window_start = iommu->tce32_start;
+		info.dma32_window_size = iommu->tce32_size;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -603,9 +647,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		struct iommu_table *tbl;
 		unsigned long tce;
 
-		if (WARN_ON(!container->grp ||
-				!iommu_group_get_iommudata(container->grp)))
-			return -ENXIO;
+		if (!container->enabled)
+			return -EPERM;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -655,10 +698,6 @@ static long tce_iommu_ioctl(void *iommu_data,
 		struct vfio_iommu_type1_dma_unmap param;
 		struct iommu_table *tbl;
 
-		if (WARN_ON(!container->grp ||
-				!iommu_group_get_iommudata(container->grp)))
-			return -ENXIO;
-
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
 				size);
 
@@ -747,12 +786,20 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
-	case VFIO_EEH_PE_OP:
-		if (!container->grp)
-			return -ENODEV;
 
-		return vfio_spapr_iommu_eeh_ioctl(container->grp,
-						  cmd, arg);
+	case VFIO_EEH_PE_OP: {
+		struct tce_iommu_group *tcegrp;
+
+		ret = 0;
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
+					cmd, arg);
+			if (ret)
+				return ret;
+		}
+		return ret;
+	}
+
 	}
 
 	return -ENOTTY;
@@ -761,40 +808,63 @@ static long tce_iommu_ioctl(void *iommu_data,
 static int tce_iommu_attach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
-	int ret = 0;
+	int ret = 0, i;
 	struct tce_container *container = iommu_data;
-	struct powerpc_iommu *iommu;
+	struct powerpc_iommu *iommu = iommu_group_get_iommudata(iommu_group);
+	struct tce_iommu_group *tcegrp;
 
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->grp) {
-		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->grp),
-				iommu_group_id(iommu_group));
-		ret = -EBUSY;
-	} else if (container->enabled) {
-		pr_err("tce_vfio: attaching group #%u to enabled container\n",
-				iommu_group_id(iommu_group));
-		ret = -EBUSY;
+
+	list_for_each_entry(tcegrp, &container->group_list, next) {
+		struct powerpc_iommu *iommutmp;
+
+		if (tcegrp->grp == iommu_group) {
+			pr_warn("tce_vfio: Group %d is already attached\n",
+					iommu_group_id(iommu_group));
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+		iommutmp = iommu_group_get_iommudata(tcegrp->grp);
+		if (iommutmp->ops != iommu->ops) {
+			pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
+					iommu_group_id(iommu_group),
+					iommu_group_id(tcegrp->grp));
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+	}
+
+	/*
+	 * Disable iommu bypass, otherwise the user can DMA to all of
+	 * our physical memory via the bypass window instead of just
+	 * the pages that has been explicitly mapped into the iommu
+	 */
+	if (iommu->ops && iommu->ops->set_ownership) {
+		iommu->ops->set_ownership(iommu, true);
 	} else {
-		iommu = iommu_group_get_iommudata(iommu_group);
-		if (WARN_ON_ONCE(!iommu))
-			return -ENXIO;
-		/*
-		 * Disable iommu bypass, otherwise the user can DMA to all of
-		 * our physical memory via the bypass window instead of just
-		 * the pages that has been explicitly mapped into the iommu
-		 */
-		if (iommu->ops && iommu->ops->set_ownership) {
-			iommu->ops->set_ownership(iommu, true);
-			container->grp = iommu_group;
-		} else {
-			return -ENODEV;
-		}
+		ret = -ENODEV;
+		goto unlock_exit;
 	}
 
+	tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL);
+	tcegrp->grp = iommu_group;
+	list_add(&tcegrp->next, &container->group_list);
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &container->tables[i];
+
+		if (!tbl->it_size)
+			continue;
+
+		/* Set the default window to a new group */
+		ret = iommu->ops->set_window(iommu, i, tbl);
+		if (ret)
+			goto unlock_exit;
+	}
+
+unlock_exit:
 	mutex_unlock(&container->lock);
 
 	return ret;
@@ -805,33 +875,30 @@ static void tce_iommu_detach_group(void *iommu_data,
 {
 	struct tce_container *container = iommu_data;
 	struct powerpc_iommu *iommu;
+	struct tce_iommu_group *tcegrp, *tcegrptmp;
+	long i;
 
 	mutex_lock(&container->lock);
-	if (iommu_group != container->grp) {
-		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
-				iommu_group_id(iommu_group),
-				iommu_group_id(container->grp));
-	} else {
-		if (container->enabled) {
-			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(container->grp));
-			tce_iommu_disable(container);
-		}
 
-		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
-				iommu_group_id(iommu_group), iommu_group); */
-		container->grp = NULL;
+	/* Detach windows from IOMMUs */
+	list_for_each_entry_safe(tcegrp, tcegrptmp, &container->group_list,
+			next) {
+		if (tcegrp->grp != iommu_group)
+			continue;
 
+		list_del(&tcegrp->next);
 		iommu = iommu_group_get_iommudata(iommu_group);
 		BUG_ON(!iommu);
 
-		tce_iommu_clear(container, &iommu->tables[0],
-				iommu->tables[0].it_offset,
-				iommu->tables[0].it_size);
+		for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i)
+			iommu->ops->unset_window(iommu, i);
 
 		/* Kernel owns the device now, we can restore bypass */
 		if (iommu->ops && iommu->ops->set_ownership)
 			iommu->ops->set_ownership(iommu, false);
+
+		kfree(tcegrp);
+		break;
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 23/24] vfio/spapr: Enable multiple groups in a container
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 243 +++++++++++++++++++++++-------------
 1 file changed, 155 insertions(+), 88 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0987ae..8bcafb7 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -84,9 +84,15 @@ static void decrement_locked_vm(long npages)
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_group *grp;
 	bool enabled;
 	struct list_head mem_list;
+	struct iommu_table tables[POWERPC_IOMMU_MAX_TABLES];
+	struct list_head group_list;
+};
+
+struct tce_iommu_group {
+	struct list_head next;
+	struct iommu_group *grp;
 };
 
 struct tce_memory {
@@ -265,17 +271,21 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
 	return false;
 }
 
+static inline bool tce_groups_attached(struct tce_container *container)
+{
+	return !list_empty(&container->group_list);
+}
+
 static struct iommu_table *spapr_tce_find_table(
 		struct tce_container *container,
 		phys_addr_t ioba)
 {
 	long i;
 	struct iommu_table *ret = NULL;
-	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
 
 	mutex_lock(&container->lock);
 	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
-		struct iommu_table *tbl = &iommu->tables[i];
+		struct iommu_table *tbl = &container->tables[i];
 		unsigned long entry = ioba >> tbl->it_page_shift;
 		unsigned long start = tbl->it_offset;
 		unsigned long end = start + tbl->it_size;
@@ -290,13 +300,31 @@ static struct iommu_table *spapr_tce_find_table(
 	return ret;
 }
 
+static unsigned long tce_default_winsize(struct tce_container *container)
+{
+	struct tce_iommu_group *tcegrp;
+	struct powerpc_iommu *iommu;
+
+	if (!tce_groups_attached(container))
+		return 0;
+
+	tcegrp = list_first_entry(&container->group_list,
+			struct tce_iommu_group, next);
+	if (!tcegrp)
+		return 0;
+
+	iommu = iommu_group_get_iommudata(tcegrp->grp);
+	if (!iommu)
+		return 0;
+
+	return iommu->tce32_size;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	struct powerpc_iommu *iommu;
-	struct iommu_table *tbl;
 
-	if (!container->grp)
+	if (!tce_groups_attached(container))
 		return -ENXIO;
 
 	if (container->enabled)
@@ -328,12 +356,8 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * KVM agnostic.
 	 */
 	if (!tce_preregistered(container)) {
-		iommu = iommu_group_get_iommudata(container->grp);
-		if (!iommu)
-			return -EFAULT;
-
-		tbl = &iommu->tables[0];
-		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
+		ret = try_increment_locked_vm(
+				tce_default_winsize(container) >> PAGE_SHIFT);
 		if (ret)
 			return ret;
 	}
@@ -343,27 +367,23 @@ static int tce_iommu_enable(struct tce_container *container)
 	return ret;
 }
 
+static int tce_iommu_clear(struct tce_container *container,
+		struct iommu_table *tbl,
+		unsigned long entry, unsigned long pages);
+
 static void tce_iommu_disable(struct tce_container *container)
 {
-	struct powerpc_iommu *iommu;
-	struct iommu_table *tbl;
-
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->grp || !current->mm)
+	if (!current->mm)
 		return;
 
-	if (!tce_preregistered(container)) {
-		iommu = iommu_group_get_iommudata(container->grp);
-		if (!iommu)
-			return;
-
-		tbl = &iommu->tables[0];
-		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
-	}
+	if (!tce_preregistered(container))
+		decrement_locked_vm(
+				tce_default_winsize(container) >> PAGE_SHIFT);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -381,20 +401,44 @@ static void *tce_iommu_open(unsigned long arg)
 
 	mutex_init(&container->lock);
 	INIT_LIST_HEAD_RCU(&container->mem_list);
+	INIT_LIST_HEAD_RCU(&container->group_list);
 
 	return container;
 }
 
 static void tce_iommu_release(void *iommu_data)
 {
+	int i;
+	struct powerpc_iommu *iommu;
+	struct tce_iommu_group *tcegrp;
 	struct tce_container *container = iommu_data;
 	struct tce_memory *mem, *memtmp;
+	struct powerpc_iommu_ops *iommuops = NULL;
 
-	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->grp)
-		tce_iommu_detach_group(iommu_data, container->grp);
+	while (tce_groups_attached(container)) {
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+		iommuops = iommu->ops;
+		tce_iommu_detach_group(iommu_data, tcegrp->grp);
+	}
+
+	/* Free tables */
+	if (iommuops) {
+		for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+			struct iommu_table *tbl = &container->tables[i];
+
+			tce_iommu_clear(container, tbl,
+					tbl->it_offset, tbl->it_size);
+
+			if (!tce_preregistered(container))
+				decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
+
+			iommuops->free_table(tbl);
+		}
+	}
 
 	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
 		tce_do_unregister_pages(container, mem);
@@ -568,16 +612,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl;
+		struct tce_iommu_group *tcegrp;
 		struct powerpc_iommu *iommu;
 
-		if (WARN_ON(!container->grp))
+		if (!tce_groups_attached(container))
 			return -ENXIO;
 
-		iommu = iommu_group_get_iommudata(container->grp);
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
 
-		tbl = &iommu->tables[0];
-		if (WARN_ON_ONCE(!tbl))
+		if (!iommu)
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
@@ -589,9 +634,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
-		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
-		info.flags = 0;
+		info.dma32_window_start = iommu->tce32_start;
+		info.dma32_window_size = iommu->tce32_size;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -603,9 +647,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		struct iommu_table *tbl;
 		unsigned long tce;
 
-		if (WARN_ON(!container->grp ||
-				!iommu_group_get_iommudata(container->grp)))
-			return -ENXIO;
+		if (!container->enabled)
+			return -EPERM;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -655,10 +698,6 @@ static long tce_iommu_ioctl(void *iommu_data,
 		struct vfio_iommu_type1_dma_unmap param;
 		struct iommu_table *tbl;
 
-		if (WARN_ON(!container->grp ||
-				!iommu_group_get_iommudata(container->grp)))
-			return -ENXIO;
-
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
 				size);
 
@@ -747,12 +786,20 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
-	case VFIO_EEH_PE_OP:
-		if (!container->grp)
-			return -ENODEV;
 
-		return vfio_spapr_iommu_eeh_ioctl(container->grp,
-						  cmd, arg);
+	case VFIO_EEH_PE_OP: {
+		struct tce_iommu_group *tcegrp;
+
+		ret = 0;
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			ret = vfio_spapr_iommu_eeh_ioctl(tcegrp->grp,
+					cmd, arg);
+			if (ret)
+				return ret;
+		}
+		return ret;
+	}
+
 	}
 
 	return -ENOTTY;
@@ -761,40 +808,63 @@ static long tce_iommu_ioctl(void *iommu_data,
 static int tce_iommu_attach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
-	int ret = 0;
+	int ret = 0, i;
 	struct tce_container *container = iommu_data;
-	struct powerpc_iommu *iommu;
+	struct powerpc_iommu *iommu = iommu_group_get_iommudata(iommu_group);
+	struct tce_iommu_group *tcegrp;
 
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->grp) {
-		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->grp),
-				iommu_group_id(iommu_group));
-		ret = -EBUSY;
-	} else if (container->enabled) {
-		pr_err("tce_vfio: attaching group #%u to enabled container\n",
-				iommu_group_id(iommu_group));
-		ret = -EBUSY;
+
+	list_for_each_entry(tcegrp, &container->group_list, next) {
+		struct powerpc_iommu *iommutmp;
+
+		if (tcegrp->grp == iommu_group) {
+			pr_warn("tce_vfio: Group %d is already attached\n",
+					iommu_group_id(iommu_group));
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+		iommutmp = iommu_group_get_iommudata(tcegrp->grp);
+		if (iommutmp->ops != iommu->ops) {
+			pr_warn("tce_vfio: Group %d is incompatible with group %d\n",
+					iommu_group_id(iommu_group),
+					iommu_group_id(tcegrp->grp));
+			ret = -EBUSY;
+			goto unlock_exit;
+		}
+	}
+
+	/*
+	 * Disable iommu bypass, otherwise the user can DMA to all of
+	 * our physical memory via the bypass window instead of just
+	 * the pages that has been explicitly mapped into the iommu
+	 */
+	if (iommu->ops && iommu->ops->set_ownership) {
+		iommu->ops->set_ownership(iommu, true);
 	} else {
-		iommu = iommu_group_get_iommudata(iommu_group);
-		if (WARN_ON_ONCE(!iommu))
-			return -ENXIO;
-		/*
-		 * Disable iommu bypass, otherwise the user can DMA to all of
-		 * our physical memory via the bypass window instead of just
-		 * the pages that has been explicitly mapped into the iommu
-		 */
-		if (iommu->ops && iommu->ops->set_ownership) {
-			iommu->ops->set_ownership(iommu, true);
-			container->grp = iommu_group;
-		} else {
-			return -ENODEV;
-		}
+		ret = -ENODEV;
+		goto unlock_exit;
 	}
 
+	tcegrp = kzalloc(sizeof(*tcegrp), GFP_KERNEL);
+	tcegrp->grp = iommu_group;
+	list_add(&tcegrp->next, &container->group_list);
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &container->tables[i];
+
+		if (!tbl->it_size)
+			continue;
+
+		/* Set the default window to a new group */
+		ret = iommu->ops->set_window(iommu, i, tbl);
+		if (ret)
+			goto unlock_exit;
+	}
+
+unlock_exit:
 	mutex_unlock(&container->lock);
 
 	return ret;
@@ -805,33 +875,30 @@ static void tce_iommu_detach_group(void *iommu_data,
 {
 	struct tce_container *container = iommu_data;
 	struct powerpc_iommu *iommu;
+	struct tce_iommu_group *tcegrp, *tcegrptmp;
+	long i;
 
 	mutex_lock(&container->lock);
-	if (iommu_group != container->grp) {
-		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
-				iommu_group_id(iommu_group),
-				iommu_group_id(container->grp));
-	} else {
-		if (container->enabled) {
-			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(container->grp));
-			tce_iommu_disable(container);
-		}
 
-		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
-				iommu_group_id(iommu_group), iommu_group); */
-		container->grp = NULL;
+	/* Detach windows from IOMMUs */
+	list_for_each_entry_safe(tcegrp, tcegrptmp, &container->group_list,
+			next) {
+		if (tcegrp->grp != iommu_group)
+			continue;
 
+		list_del(&tcegrp->next);
 		iommu = iommu_group_get_iommudata(iommu_group);
 		BUG_ON(!iommu);
 
-		tce_iommu_clear(container, &iommu->tables[0],
-				iommu->tables[0].it_offset,
-				iommu->tables[0].it_size);
+		for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i)
+			iommu->ops->unset_window(iommu, i);
 
 		/* Kernel owns the device now, we can restore bypass */
 		if (iommu->ops && iommu->ops->set_ownership)
 			iommu->ops->set_ownership(iommu, false);
+
+		kfree(tcegrp);
+		break;
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 24/24] vfio: powerpc/spapr: Support Dynamic DMA windows
  2015-01-29  9:21 ` Alexey Kardashevskiy
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alex Williamson, Alexander Graf,
	Alexander Gordeev, linux-kernel

This adds create/remove window ioctls to create and remove DMA windows.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 137 +++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h           |  24 ++++++-
 3 files changed, 160 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 33009f9..7ca1c8c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -133,7 +133,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
 
-#define POWERPC_IOMMU_MAX_TABLES	1
+#define POWERPC_IOMMU_MAX_TABLES	2
 
 #define POWERPC_IOMMU_DEFAULT_LEVELS	1
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8bcafb7..d3a1cc9 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -300,6 +300,20 @@ static struct iommu_table *spapr_tce_find_table(
 	return ret;
 }
 
+static int spapr_tce_find_free_table(struct tce_container *container)
+{
+	int i;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &container->tables[i];
+
+		if (!tbl->it_size)
+			return i;
+	}
+
+	return -1;
+}
+
 static unsigned long tce_default_winsize(struct tce_container *container)
 {
 	struct tce_iommu_group *tcegrp;
@@ -594,7 +608,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz;
+	unsigned long minsz, ddwsz;
 	long ret;
 
 	switch (cmd) {
@@ -636,6 +650,15 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 		info.dma32_window_start = iommu->tce32_start;
 		info.dma32_window_size = iommu->tce32_size;
+		info.windows_supported = iommu->windows_supported;
+		info.levels = iommu->levels;
+		info.flags = iommu->flags;
+
+		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+				levels);
+
+		if (info.argsz == ddwsz)
+			minsz = ddwsz;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -800,6 +823,118 @@ static long tce_iommu_ioctl(void *iommu_data,
 		return ret;
 	}
 
+	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+		struct vfio_iommu_spapr_tce_create create;
+		struct powerpc_iommu *iommu;
+		struct tce_iommu_group *tcegrp;
+		int num;
+
+		if (!tce_preregistered(container))
+			return -ENXIO;
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+				start_addr);
+
+		if (copy_from_user(&create, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (create.argsz < minsz)
+			return -EINVAL;
+
+		if (create.flags)
+			return -EINVAL;
+
+		num = spapr_tce_find_free_table(container);
+		if (num < 0)
+			return -ENOSYS;
+
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+
+		ret = iommu->ops->create_table(iommu, num,
+				create.page_shift, create.window_shift,
+				create.levels,
+				&container->tables[num]);
+		if (ret)
+			return ret;
+
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			struct powerpc_iommu *iommutmp =
+					iommu_group_get_iommudata(tcegrp->grp);
+
+			if (WARN_ON_ONCE(iommutmp->ops != iommu->ops))
+				return -EFAULT;
+
+			ret = iommu->ops->set_window(iommutmp, num,
+					&container->tables[num]);
+			if (ret)
+				return ret;
+		}
+
+		create.start_addr =
+				container->tables[num].it_offset <<
+				container->tables[num].it_page_shift;
+
+		if (copy_to_user((void __user *)arg, &create, minsz))
+			return -EFAULT;
+
+		mutex_lock(&container->lock);
+		mutex_unlock(&container->lock);
+
+		return ret;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+		struct vfio_iommu_spapr_tce_remove remove;
+		struct powerpc_iommu *iommu = NULL;
+		struct iommu_table *tbl;
+		struct tce_iommu_group *tcegrp;
+		int num;
+
+		if (!tce_preregistered(container))
+			return -ENXIO;
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+				start_addr);
+
+		if (copy_from_user(&remove, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (remove.argsz < minsz)
+			return -EINVAL;
+
+		if (remove.flags)
+			return -EINVAL;
+
+
+		tbl = spapr_tce_find_table(container, remove.start_addr);
+		if (!tbl)
+			return -EINVAL;
+
+		/* Detach windows from IOMMUs */
+		mutex_lock(&container->lock);
+
+		/* Detach groups from IOMMUs */
+		num = tbl - container->tables;
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			iommu = iommu_group_get_iommudata(tcegrp->grp);
+			if (container->tables[num].it_size)
+				iommu->ops->unset_window(iommu, num);
+		}
+
+		/* Free table */
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+
+		tce_iommu_clear(container, tbl,
+				tbl->it_offset, tbl->it_size);
+		iommu->ops->free_table(tbl);
+
+		mutex_unlock(&container->lock);
+
+		return 0;
+	}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2bb0c9b..7ed7000 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -483,9 +483,11 @@ struct vfio_iommu_type1_unregister_memory {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	__u32 windows_supported;
+	__u32 levels;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -521,6 +523,26 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	__u32 levels;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 24/24] vfio: powerpc/spapr: Support Dynamic DMA windows
@ 2015-01-29  9:22   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-29  9:22 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Gavin Shan, Alexander Graf,
	Alex Williamson, Alexander Gordeev, Paul Mackerras, linux-kernel

This adds create/remove window ioctls to create and remove DMA windows.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 137 +++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h           |  24 ++++++-
 3 files changed, 160 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 33009f9..7ca1c8c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -133,7 +133,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
 
-#define POWERPC_IOMMU_MAX_TABLES	1
+#define POWERPC_IOMMU_MAX_TABLES	2
 
 #define POWERPC_IOMMU_DEFAULT_LEVELS	1
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8bcafb7..d3a1cc9 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -300,6 +300,20 @@ static struct iommu_table *spapr_tce_find_table(
 	return ret;
 }
 
+static int spapr_tce_find_free_table(struct tce_container *container)
+{
+	int i;
+
+	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
+		struct iommu_table *tbl = &container->tables[i];
+
+		if (!tbl->it_size)
+			return i;
+	}
+
+	return -1;
+}
+
 static unsigned long tce_default_winsize(struct tce_container *container)
 {
 	struct tce_iommu_group *tcegrp;
@@ -594,7 +608,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 				 unsigned int cmd, unsigned long arg)
 {
 	struct tce_container *container = iommu_data;
-	unsigned long minsz;
+	unsigned long minsz, ddwsz;
 	long ret;
 
 	switch (cmd) {
@@ -636,6 +650,15 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 		info.dma32_window_start = iommu->tce32_start;
 		info.dma32_window_size = iommu->tce32_size;
+		info.windows_supported = iommu->windows_supported;
+		info.levels = iommu->levels;
+		info.flags = iommu->flags;
+
+		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+				levels);
+
+		if (info.argsz == ddwsz)
+			minsz = ddwsz;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -800,6 +823,118 @@ static long tce_iommu_ioctl(void *iommu_data,
 		return ret;
 	}
 
+	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+		struct vfio_iommu_spapr_tce_create create;
+		struct powerpc_iommu *iommu;
+		struct tce_iommu_group *tcegrp;
+		int num;
+
+		if (!tce_preregistered(container))
+			return -ENXIO;
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+				start_addr);
+
+		if (copy_from_user(&create, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (create.argsz < minsz)
+			return -EINVAL;
+
+		if (create.flags)
+			return -EINVAL;
+
+		num = spapr_tce_find_free_table(container);
+		if (num < 0)
+			return -ENOSYS;
+
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+
+		ret = iommu->ops->create_table(iommu, num,
+				create.page_shift, create.window_shift,
+				create.levels,
+				&container->tables[num]);
+		if (ret)
+			return ret;
+
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			struct powerpc_iommu *iommutmp =
+					iommu_group_get_iommudata(tcegrp->grp);
+
+			if (WARN_ON_ONCE(iommutmp->ops != iommu->ops))
+				return -EFAULT;
+
+			ret = iommu->ops->set_window(iommutmp, num,
+					&container->tables[num]);
+			if (ret)
+				return ret;
+		}
+
+		create.start_addr =
+				container->tables[num].it_offset <<
+				container->tables[num].it_page_shift;
+
+		if (copy_to_user((void __user *)arg, &create, minsz))
+			return -EFAULT;
+
+		mutex_lock(&container->lock);
+		mutex_unlock(&container->lock);
+
+		return ret;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+		struct vfio_iommu_spapr_tce_remove remove;
+		struct powerpc_iommu *iommu = NULL;
+		struct iommu_table *tbl;
+		struct tce_iommu_group *tcegrp;
+		int num;
+
+		if (!tce_preregistered(container))
+			return -ENXIO;
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+				start_addr);
+
+		if (copy_from_user(&remove, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (remove.argsz < minsz)
+			return -EINVAL;
+
+		if (remove.flags)
+			return -EINVAL;
+
+
+		tbl = spapr_tce_find_table(container, remove.start_addr);
+		if (!tbl)
+			return -EINVAL;
+
+		/* Detach windows from IOMMUs */
+		mutex_lock(&container->lock);
+
+		/* Detach groups from IOMMUs */
+		num = tbl - container->tables;
+		list_for_each_entry(tcegrp, &container->group_list, next) {
+			iommu = iommu_group_get_iommudata(tcegrp->grp);
+			if (container->tables[num].it_size)
+				iommu->ops->unset_window(iommu, num);
+		}
+
+		/* Free table */
+		tcegrp = list_first_entry(&container->group_list,
+				struct tce_iommu_group, next);
+		iommu = iommu_group_get_iommudata(tcegrp->grp);
+
+		tce_iommu_clear(container, tbl,
+				tbl->it_offset, tbl->it_size);
+		iommu->ops->free_table(tbl);
+
+		mutex_unlock(&container->lock);
+
+		return 0;
+	}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2bb0c9b..7ed7000 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -483,9 +483,11 @@ struct vfio_iommu_type1_unregister_memory {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	__u32 windows_supported;
+	__u32 levels;
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
@@ -521,6 +523,26 @@ struct vfio_eeh_pe_op {
 
 #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	__u32 levels;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/24] vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
  2015-01-29  9:21   ` Alexey Kardashevskiy
@ 2015-02-02 21:45     ` Alex Williamson
  -1 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-02 21:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> This checks that the TCE table page size is not bigger that the size of
> a page we just pinned and going to put its physical address to the table.
> 
> Otherwise the hardware gets unwanted access to physical memory between
> the end of the actual page and the end of the aligned up TCE page.
> 
> Since compound_order() and compound_head() work correctly on non-huge
> pages, there is no need for additional check whether the page is huge.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v5:
> * check is done for all page sizes now, not just for huge pages
> * failed check returns EFAULT now (was EINVAL)
> * moved the check to VFIO SPAPR IOMMU driver
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index dc4a886..99b98fa 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -47,6 +47,22 @@ struct tce_container {
>  	bool enabled;
>  };
>  
> +static bool tce_check_page_size(struct page *page, unsigned page_shift)

What does true/false mean for a "check page size" operation?  Does true
mean good?  Bad?  How about naming it page-is-contained or something
along those lines?

> +{
> +	unsigned shift;
> +
> +	/*
> +	 * Check that the TCE table granularity is not bigger than the size of
> +	 * a page we just found. Otherwise the hardware can get access to
> +	 * a bigger memory chunk that it should.
> +	 */
> +	shift = PAGE_SHIFT + compound_order(compound_head(page));
> +	if (shift >= page_shift)
> +		return true;
> +
> +	return false;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> @@ -199,6 +215,12 @@ static long tce_iommu_build(struct tce_container *container,
>  			ret = -EFAULT;
>  			break;
>  		}
> +
> +		if (!tce_check_page_size(page, tbl->it_page_shift)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
>  		hva = (unsigned long) page_address(page) +
>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>  




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 02/24] vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size
@ 2015-02-02 21:45     ` Alex Williamson
  0 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-02 21:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> This checks that the TCE table page size is not bigger that the size of
> a page we just pinned and going to put its physical address to the table.
> 
> Otherwise the hardware gets unwanted access to physical memory between
> the end of the actual page and the end of the aligned up TCE page.
> 
> Since compound_order() and compound_head() work correctly on non-huge
> pages, there is no need for additional check whether the page is huge.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v5:
> * check is done for all page sizes now, not just for huge pages
> * failed check returns EFAULT now (was EINVAL)
> * moved the check to VFIO SPAPR IOMMU driver
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index dc4a886..99b98fa 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -47,6 +47,22 @@ struct tce_container {
>  	bool enabled;
>  };
>  
> +static bool tce_check_page_size(struct page *page, unsigned page_shift)

What does true/false mean for a "check page size" operation?  Does true
mean good?  Bad?  How about naming it page-is-contained or something
along those lines?

> +{
> +	unsigned shift;
> +
> +	/*
> +	 * Check that the TCE table granularity is not bigger than the size of
> +	 * a page we just found. Otherwise the hardware can get access to
> +	 * a bigger memory chunk that it should.
> +	 */
> +	shift = PAGE_SHIFT + compound_order(compound_head(page));
> +	if (shift >= page_shift)
> +		return true;
> +
> +	return false;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> @@ -199,6 +215,12 @@ static long tce_iommu_build(struct tce_container *container,
>  			ret = -EFAULT;
>  			break;
>  		}
> +
> +		if (!tce_check_page_size(page, tbl->it_page_shift)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
>  		hva = (unsigned long) page_address(page) +
>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>  

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
  2015-01-29  9:21   ` Alexey Kardashevskiy
@ 2015-02-03  0:11     ` Alex Williamson
  -1 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter which is going to be even worse with multiple
> containers and huge DMA windows.
> 
> This introduces 2 ioctls to register/unregister DMA memory which
> receive user space address and size of the memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> 
> If any memory region was registered, all subsequent DMA map requests
> should address already pinned memory. If no memory was registered,
> then the amount of memory required for a single default memory will be
> accounted when the container is enabled and every map/unmap will pin/unpin
> a page.
> 
> Dynamic DMA window and in-kernel acceleration will require memory to
> be registered in order to work.
> 
> The accounting is done per VFIO container. When the support of
> multiple groups per container is added, we will have accurate locked_vm
> accounting.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h           |  29 ++++
>  2 files changed, 331 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 8256275..d0987ae 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,8 +86,169 @@ struct tce_container {
>  	struct mutex lock;
>  	struct iommu_group *grp;
>  	bool enabled;
> +	struct list_head mem_list;
>  };
>  
> +struct tce_memory {
> +	struct list_head next;
> +	struct rcu_head rcu;
> +	__u64 vaddr;
> +	__u64 size;
> +	__u64 pfns[];
> +};

So we're using 2MB of kernel memory per 1G of user mapped memory, right?
Or are we using bigger pages here?  I'm not sure the kmalloc below is
the appropriate allocator for something that can be so large.

> +
> +static void tce_unpin_pages(struct tce_container *container,
> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
> +{
> +	__u64 off;
> +	struct page *page = NULL;
> +
> +
> +	for (off = 0; off < size; off += PAGE_SIZE) {
> +		if (!mem->pfns[off >> PAGE_SHIFT])
> +			continue;
> +
> +		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
> +		if (!page)
> +			continue;
> +
> +		put_page(page);
> +		mem->pfns[off >> PAGE_SHIFT] = 0;
> +	}

Seems cleaner to count by 1 rather than PAGE_SIZE (ie. shift size once
instead of off 3 times).

> +}
> +
> +static void release_tce_memory(struct rcu_head *head)
> +{
> +	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
> +
> +	kfree(mem);
> +}
> +
> +static void tce_do_unregister_pages(struct tce_container *container,
> +		struct tce_memory *mem)
> +{
> +	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
> +	decrement_locked_vm(mem->size);
> +	list_del_rcu(&mem->next);
> +	call_rcu_sched(&mem->rcu, release_tce_memory);
> +}
> +
> +static long tce_unregister_pages(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	struct tce_memory *mem, *memtmp;
> +
> +	if (container->enabled)
> +		return -EBUSY;
> +
> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> +		return -EINVAL;
> +
> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
> +		if ((mem->vaddr == vaddr) && (mem->size == size)) {
> +			tce_do_unregister_pages(container, mem);
> +			return 0;
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +static long tce_pin_pages(struct tce_container *container,
> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
> +{
> +	__u64 off;
> +	struct page *page = NULL;
> +
> +	for (off = 0; off < size; off += PAGE_SIZE) {
> +		if (1 != get_user_pages_fast(vaddr + off,
> +					1/* pages */, 1/* iswrite */, &page)) {
> +			tce_unpin_pages(container, mem, vaddr, off);
> +			return -EFAULT;
> +		}
> +
> +		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
> +	}
> +
> +	return 0;
> +}
> +
> +static long tce_register_pages(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	long ret;
> +	struct tce_memory *mem;
> +
> +	if (container->enabled)
> +		return -EBUSY;
> +
> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> +			((vaddr + size) < vaddr))
> +		return -EINVAL;
> +
> +	/* Any overlap with registered chunks? */
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr < (vaddr + size)) &&
> +				(vaddr < (mem->vaddr + mem->size))) {
> +			ret = -EBUSY;
> +			goto unlock_exit;
> +		}
> +	}
> +
> +	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
> +	if (ret)
> +		goto unlock_exit;
> +
> +	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);


I suspect that userspace can break kmalloc with the potential size of
this structure.  You might need a vmalloc.  I also wonder if there isn't
a more efficient tree structure to use.

> +	if (!mem)
> +		goto unlock_exit;
> +
> +	if (tce_pin_pages(container, mem, vaddr, size))
> +		goto free_exit;
> +
> +	mem->vaddr = vaddr;
> +	mem->size = size;
> +
> +	list_add_rcu(&mem->next, &container->mem_list);
> +	rcu_read_unlock();
> +
> +	return 0;
> +
> +free_exit:
> +	kfree(mem);
> +
> +unlock_exit:
> +	decrement_locked_vm(size >> PAGE_SHIFT);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool tce_preregistered(struct tce_container *container)
> +{
> +	return !list_empty(&container->mem_list);
> +}
> +
> +static bool tce_pinned(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	struct tce_memory *mem;
> +	bool ret = false;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr <= vaddr) &&
> +				(vaddr + size <= mem->vaddr + mem->size)) {
> +			ret = true;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  {
>  	unsigned shift;
> @@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * as this information is only available from KVM and VFIO is
>  	 * KVM agnostic.
>  	 */
> -	iommu = iommu_group_get_iommudata(container->grp);
> -	if (!iommu)
> -		return -EFAULT;
> +	if (!tce_preregistered(container)) {
> +		iommu = iommu_group_get_iommudata(container->grp);
> +		if (!iommu)
> +			return -EFAULT;
>  
> -	tbl = &iommu->tables[0];
> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
> -	if (ret)
> -		return ret;
> +		tbl = &iommu->tables[0];
> +		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +		if (ret)
> +			return ret;
> +	}
>  
>  	container->enabled = true;
>  
> @@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
>  	if (!container->grp || !current->mm)
>  		return;
>  
> -	iommu = iommu_group_get_iommudata(container->grp);
> -	if (!iommu)
> -		return;
> +	if (!tce_preregistered(container)) {
> +		iommu = iommu_group_get_iommudata(container->grp);
> +		if (!iommu)
> +			return;
>  
> -	tbl = &iommu->tables[0];
> -	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +		tbl = &iommu->tables[0];
> +		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +	}
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
>  		return ERR_PTR(-ENOMEM);
>  
>  	mutex_init(&container->lock);
> +	INIT_LIST_HEAD_RCU(&container->mem_list);
>  
>  	return container;
>  }
> @@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
>  static void tce_iommu_release(void *iommu_data)
>  {
>  	struct tce_container *container = iommu_data;
> +	struct tce_memory *mem, *memtmp;
>  
>  	WARN_ON(container->grp);
>  	tce_iommu_disable(container);
> @@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
>  	if (container->grp)
>  		tce_iommu_detach_group(iommu_data, container->grp);
>  
> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
> +		tce_do_unregister_pages(container, mem);
> +
>  	mutex_destroy(&container->lock);
>  
>  	kfree(container);
>  }
>  
> -static void tce_iommu_unuse_page(unsigned long oldtce)
> +static void tce_iommu_unuse_page(struct tce_container *container,
> +		unsigned long oldtce)
>  {
>  	struct page *page;
> +	bool do_put = !tce_preregistered(container);
>  
>  	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>  		return;
> @@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
>  	if (oldtce & TCE_PCI_WRITE)
>  		SetPageDirty(page);
>  
> -	put_page(page);
> +	if (do_put)
> +		put_page(page);
>  }
>  
>  static int tce_iommu_clear(struct tce_container *container,
> @@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
>  		if (ret)
>  			continue;
>  
> -		tce_iommu_unuse_page(oldtce);
> +		tce_iommu_unuse_page(container, oldtce);
>  	}
>  
>  	return 0;
> @@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
>  		return DMA_NONE;
>  }
>  
> +static unsigned long tce_get_hva_cached(struct tce_container *container,
> +		unsigned page_shift, unsigned long tce)
> +{
> +	struct tce_memory *mem;
> +	struct page *page = NULL;
> +	unsigned long hva = -1;
> +
> +	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
> +			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
> +			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
> +
> +			page = pfn_to_page(mem->pfns[gfn]);
> +
> +			if (!tce_check_page_size(page, page_shift))
> +				break;
> +
> +			hva = (unsigned long) __va(hpa);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return hva;
> +}
> +
> +static unsigned long tce_get_hva(struct tce_container *container,
> +		unsigned page_shift, unsigned long tce)
> +{
> +	long ret = 0;
> +	struct page *page = NULL;
> +	unsigned long hva = -1;
> +	enum dma_data_direction direction = tce_iommu_direction(tce);
> +
> +	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> +			direction != DMA_TO_DEVICE, &page);
> +	if (unlikely(ret != 1))
> +		return -1;
> +
> +	if (!tce_check_page_size(page, page_shift)) {
> +		put_page(page);
> +		return -1;
> +	}
> +
> +	hva = (unsigned long) page_address(page) +
> +		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
> +
> +	return hva;
> +}
> +
>  static long tce_iommu_build(struct tce_container *container,
>  		struct iommu_table *tbl,
>  		unsigned long entry, unsigned long tce, unsigned long pages)
>  {
>  	long i, ret = 0;
> -	struct page *page = NULL;
>  	unsigned long hva, oldtce;
>  	enum dma_data_direction direction = tce_iommu_direction(tce);
> +	bool do_put = false;
>  
>  	for (i = 0; i < pages; ++i) {
> -		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> -				direction != DMA_TO_DEVICE, &page);
> -		if (unlikely(ret != 1)) {
> -			ret = -EFAULT;
> -			break;
> +		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
> +		if (hva == -1) {
> +			do_put = true;
> +			WARN_ON_ONCE(1);
> +			hva = tce_get_hva(container, tbl->it_page_shift, tce);
>  		}
>  
> -		if (!tce_check_page_size(page, tbl->it_page_shift)) {
> -			ret = -EFAULT;
> -			break;
> -		}
> -
> -		hva = (unsigned long) page_address(page) +
> -			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>  		oldtce = 0;
> -
>  		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
>  		if (ret) {
> -			put_page(page);
> +			if (do_put)
> +				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>  					__func__, entry << tbl->it_page_shift,
>  					tce, ret);
>  			break;
>  		}
>  
> -		tce_iommu_unuse_page(oldtce);
> +		if (do_put)
> +			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
> +
> +		tce_iommu_unuse_page(container, oldtce);
> +
>  		tce += IOMMU_PAGE_SIZE(tbl);
>  	}
>  
> @@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		if (ret)
>  			return ret;
>  
> +		/* If any memory is pinned, only allow pages from that region */
> +		if (tce_preregistered(container) &&
> +				!tce_pinned(container, param.vaddr, param.size))
> +			return -EPERM;
> +
>  		ret = tce_iommu_build(container, tbl,
>  				param.iova >> tbl->it_page_shift,
>  				tce, param.size >> tbl->it_page_shift);
> @@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
>  
>  		return ret;
>  	}
> +	case VFIO_IOMMU_REGISTER_MEMORY: {
> +		struct vfio_iommu_type1_register_memory param;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
> +				size);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* No flag is supported now */
> +		if (param.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&container->lock);
> +		ret = tce_register_pages(container, param.vaddr, param.size);
> +		mutex_unlock(&container->lock);
> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_UNREGISTER_MEMORY: {
> +		struct vfio_iommu_type1_unregister_memory param;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
> +				size);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* No flag is supported now */
> +		if (param.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&container->lock);
> +		tce_unregister_pages(container, param.vaddr, param.size);
> +		mutex_unlock(&container->lock);
> +
> +		return 0;
> +	}
>  	case VFIO_IOMMU_ENABLE:
>  		mutex_lock(&container->lock);
>  		ret = tce_iommu_enable(container);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 29715d2..2bb0c9b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
> + *
> + * Registers user space memory where DMA is allowed. It pins
> + * user pages and does the locked memory accounting so
> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
> + * get simpler.
> + */
> +struct vfio_iommu_type1_register_memory {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u64	vaddr;				/* Process virtual address */
> +	__u64	size;				/* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +/**
> + * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
> + *
> + * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
> + */
> +struct vfio_iommu_type1_unregister_memory {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u64	vaddr;				/* Process virtual address */
> +	__u64	size;				/* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +

Is the user allowed to unregister arbitrary sub-regions of previously
registered memory?  (I think I know the answer, but it should be
documented)

Why are these "type1" structures, shouldn't they be down below?

Do we need an extension or flag bit to describe these as present or is
it sufficient to call and fail?

Do we need two ioctls or one?

What about Documentation/vfio.txt?

>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
@ 2015-02-03  0:11     ` Alex Williamson
  0 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter which is going to be even worse with multiple
> containers and huge DMA windows.
> 
> This introduces 2 ioctls to register/unregister DMA memory which
> receive user space address and size of the memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> 
> If any memory region was registered, all subsequent DMA map requests
> should address already pinned memory. If no memory was registered,
> then the amount of memory required for a single default memory will be
> accounted when the container is enabled and every map/unmap will pin/unpin
> a page.
> 
> Dynamic DMA window and in-kernel acceleration will require memory to
> be registered in order to work.
> 
> The accounting is done per VFIO container. When the support of
> multiple groups per container is added, we will have accurate locked_vm
> accounting.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h           |  29 ++++
>  2 files changed, 331 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 8256275..d0987ae 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,8 +86,169 @@ struct tce_container {
>  	struct mutex lock;
>  	struct iommu_group *grp;
>  	bool enabled;
> +	struct list_head mem_list;
>  };
>  
> +struct tce_memory {
> +	struct list_head next;
> +	struct rcu_head rcu;
> +	__u64 vaddr;
> +	__u64 size;
> +	__u64 pfns[];
> +};

So we're using 2MB of kernel memory per 1G of user mapped memory, right?
Or are we using bigger pages here?  I'm not sure the kmalloc below is
the appropriate allocator for something that can be so large.

> +
> +static void tce_unpin_pages(struct tce_container *container,
> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
> +{
> +	__u64 off;
> +	struct page *page = NULL;
> +
> +
> +	for (off = 0; off < size; off += PAGE_SIZE) {
> +		if (!mem->pfns[off >> PAGE_SHIFT])
> +			continue;
> +
> +		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
> +		if (!page)
> +			continue;
> +
> +		put_page(page);
> +		mem->pfns[off >> PAGE_SHIFT] = 0;
> +	}

Seems cleaner to count by 1 rather than PAGE_SIZE (ie. shift size once
instead of off 3 times).

> +}
> +
> +static void release_tce_memory(struct rcu_head *head)
> +{
> +	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
> +
> +	kfree(mem);
> +}
> +
> +static void tce_do_unregister_pages(struct tce_container *container,
> +		struct tce_memory *mem)
> +{
> +	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
> +	decrement_locked_vm(mem->size);
> +	list_del_rcu(&mem->next);
> +	call_rcu_sched(&mem->rcu, release_tce_memory);
> +}
> +
> +static long tce_unregister_pages(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	struct tce_memory *mem, *memtmp;
> +
> +	if (container->enabled)
> +		return -EBUSY;
> +
> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> +		return -EINVAL;
> +
> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
> +		if ((mem->vaddr == vaddr) && (mem->size == size)) {
> +			tce_do_unregister_pages(container, mem);
> +			return 0;
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +static long tce_pin_pages(struct tce_container *container,
> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
> +{
> +	__u64 off;
> +	struct page *page = NULL;
> +
> +	for (off = 0; off < size; off += PAGE_SIZE) {
> +		if (1 != get_user_pages_fast(vaddr + off,
> +					1/* pages */, 1/* iswrite */, &page)) {
> +			tce_unpin_pages(container, mem, vaddr, off);
> +			return -EFAULT;
> +		}
> +
> +		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
> +	}
> +
> +	return 0;
> +}
> +
> +static long tce_register_pages(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	long ret;
> +	struct tce_memory *mem;
> +
> +	if (container->enabled)
> +		return -EBUSY;
> +
> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> +			((vaddr + size) < vaddr))
> +		return -EINVAL;
> +
> +	/* Any overlap with registered chunks? */
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr < (vaddr + size)) &&
> +				(vaddr < (mem->vaddr + mem->size))) {
> +			ret = -EBUSY;
> +			goto unlock_exit;
> +		}
> +	}
> +
> +	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
> +	if (ret)
> +		goto unlock_exit;
> +
> +	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);


I suspect that userspace can break kmalloc with the potential size of
this structure.  You might need a vmalloc.  I also wonder if there isn't
a more efficient tree structure to use.

> +	if (!mem)
> +		goto unlock_exit;
> +
> +	if (tce_pin_pages(container, mem, vaddr, size))
> +		goto free_exit;
> +
> +	mem->vaddr = vaddr;
> +	mem->size = size;
> +
> +	list_add_rcu(&mem->next, &container->mem_list);
> +	rcu_read_unlock();
> +
> +	return 0;
> +
> +free_exit:
> +	kfree(mem);
> +
> +unlock_exit:
> +	decrement_locked_vm(size >> PAGE_SHIFT);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +static inline bool tce_preregistered(struct tce_container *container)
> +{
> +	return !list_empty(&container->mem_list);
> +}
> +
> +static bool tce_pinned(struct tce_container *container,
> +		__u64 vaddr, __u64 size)
> +{
> +	struct tce_memory *mem;
> +	bool ret = false;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr <= vaddr) &&
> +				(vaddr + size <= mem->vaddr + mem->size)) {
> +			ret = true;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  {
>  	unsigned shift;
> @@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * as this information is only available from KVM and VFIO is
>  	 * KVM agnostic.
>  	 */
> -	iommu = iommu_group_get_iommudata(container->grp);
> -	if (!iommu)
> -		return -EFAULT;
> +	if (!tce_preregistered(container)) {
> +		iommu = iommu_group_get_iommudata(container->grp);
> +		if (!iommu)
> +			return -EFAULT;
>  
> -	tbl = &iommu->tables[0];
> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
> -	if (ret)
> -		return ret;
> +		tbl = &iommu->tables[0];
> +		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +		if (ret)
> +			return ret;
> +	}
>  
>  	container->enabled = true;
>  
> @@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
>  	if (!container->grp || !current->mm)
>  		return;
>  
> -	iommu = iommu_group_get_iommudata(container->grp);
> -	if (!iommu)
> -		return;
> +	if (!tce_preregistered(container)) {
> +		iommu = iommu_group_get_iommudata(container->grp);
> +		if (!iommu)
> +			return;
>  
> -	tbl = &iommu->tables[0];
> -	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +		tbl = &iommu->tables[0];
> +		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
> +	}
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
>  		return ERR_PTR(-ENOMEM);
>  
>  	mutex_init(&container->lock);
> +	INIT_LIST_HEAD_RCU(&container->mem_list);
>  
>  	return container;
>  }
> @@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
>  static void tce_iommu_release(void *iommu_data)
>  {
>  	struct tce_container *container = iommu_data;
> +	struct tce_memory *mem, *memtmp;
>  
>  	WARN_ON(container->grp);
>  	tce_iommu_disable(container);
> @@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
>  	if (container->grp)
>  		tce_iommu_detach_group(iommu_data, container->grp);
>  
> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
> +		tce_do_unregister_pages(container, mem);
> +
>  	mutex_destroy(&container->lock);
>  
>  	kfree(container);
>  }
>  
> -static void tce_iommu_unuse_page(unsigned long oldtce)
> +static void tce_iommu_unuse_page(struct tce_container *container,
> +		unsigned long oldtce)
>  {
>  	struct page *page;
> +	bool do_put = !tce_preregistered(container);
>  
>  	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>  		return;
> @@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
>  	if (oldtce & TCE_PCI_WRITE)
>  		SetPageDirty(page);
>  
> -	put_page(page);
> +	if (do_put)
> +		put_page(page);
>  }
>  
>  static int tce_iommu_clear(struct tce_container *container,
> @@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
>  		if (ret)
>  			continue;
>  
> -		tce_iommu_unuse_page(oldtce);
> +		tce_iommu_unuse_page(container, oldtce);
>  	}
>  
>  	return 0;
> @@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
>  		return DMA_NONE;
>  }
>  
> +static unsigned long tce_get_hva_cached(struct tce_container *container,
> +		unsigned page_shift, unsigned long tce)
> +{
> +	struct tce_memory *mem;
> +	struct page *page = NULL;
> +	unsigned long hva = -1;
> +
> +	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
> +		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
> +			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
> +			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
> +
> +			page = pfn_to_page(mem->pfns[gfn]);
> +
> +			if (!tce_check_page_size(page, page_shift))
> +				break;
> +
> +			hva = (unsigned long) __va(hpa);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return hva;
> +}
> +
> +static unsigned long tce_get_hva(struct tce_container *container,
> +		unsigned page_shift, unsigned long tce)
> +{
> +	long ret = 0;
> +	struct page *page = NULL;
> +	unsigned long hva = -1;
> +	enum dma_data_direction direction = tce_iommu_direction(tce);
> +
> +	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> +			direction != DMA_TO_DEVICE, &page);
> +	if (unlikely(ret != 1))
> +		return -1;
> +
> +	if (!tce_check_page_size(page, page_shift)) {
> +		put_page(page);
> +		return -1;
> +	}
> +
> +	hva = (unsigned long) page_address(page) +
> +		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
> +
> +	return hva;
> +}
> +
>  static long tce_iommu_build(struct tce_container *container,
>  		struct iommu_table *tbl,
>  		unsigned long entry, unsigned long tce, unsigned long pages)
>  {
>  	long i, ret = 0;
> -	struct page *page = NULL;
>  	unsigned long hva, oldtce;
>  	enum dma_data_direction direction = tce_iommu_direction(tce);
> +	bool do_put = false;
>  
>  	for (i = 0; i < pages; ++i) {
> -		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> -				direction != DMA_TO_DEVICE, &page);
> -		if (unlikely(ret != 1)) {
> -			ret = -EFAULT;
> -			break;
> +		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
> +		if (hva == -1) {
> +			do_put = true;
> +			WARN_ON_ONCE(1);
> +			hva = tce_get_hva(container, tbl->it_page_shift, tce);
>  		}
>  
> -		if (!tce_check_page_size(page, tbl->it_page_shift)) {
> -			ret = -EFAULT;
> -			break;
> -		}
> -
> -		hva = (unsigned long) page_address(page) +
> -			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>  		oldtce = 0;
> -
>  		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
>  		if (ret) {
> -			put_page(page);
> +			if (do_put)
> +				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>  					__func__, entry << tbl->it_page_shift,
>  					tce, ret);
>  			break;
>  		}
>  
> -		tce_iommu_unuse_page(oldtce);
> +		if (do_put)
> +			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
> +
> +		tce_iommu_unuse_page(container, oldtce);
> +
>  		tce += IOMMU_PAGE_SIZE(tbl);
>  	}
>  
> @@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		if (ret)
>  			return ret;
>  
> +		/* If any memory is pinned, only allow pages from that region */
> +		if (tce_preregistered(container) &&
> +				!tce_pinned(container, param.vaddr, param.size))
> +			return -EPERM;
> +
>  		ret = tce_iommu_build(container, tbl,
>  				param.iova >> tbl->it_page_shift,
>  				tce, param.size >> tbl->it_page_shift);
> @@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
>  
>  		return ret;
>  	}
> +	case VFIO_IOMMU_REGISTER_MEMORY: {
> +		struct vfio_iommu_type1_register_memory param;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
> +				size);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* No flag is supported now */
> +		if (param.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&container->lock);
> +		ret = tce_register_pages(container, param.vaddr, param.size);
> +		mutex_unlock(&container->lock);
> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_UNREGISTER_MEMORY: {
> +		struct vfio_iommu_type1_unregister_memory param;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
> +				size);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* No flag is supported now */
> +		if (param.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&container->lock);
> +		tce_unregister_pages(container, param.vaddr, param.size);
> +		mutex_unlock(&container->lock);
> +
> +		return 0;
> +	}
>  	case VFIO_IOMMU_ENABLE:
>  		mutex_lock(&container->lock);
>  		ret = tce_iommu_enable(container);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 29715d2..2bb0c9b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/**
> + * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
> + *
> + * Registers user space memory where DMA is allowed. It pins
> + * user pages and does the locked memory accounting so
> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
> + * get simpler.
> + */
> +struct vfio_iommu_type1_register_memory {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u64	vaddr;				/* Process virtual address */
> +	__u64	size;				/* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +/**
> + * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
> + *
> + * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
> + */
> +struct vfio_iommu_type1_unregister_memory {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u64	vaddr;				/* Process virtual address */
> +	__u64	size;				/* Size of mapping (bytes) */
> +};
> +#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +

Is the user allowed to unregister arbitrary sub-regions of previously
registered memory?  (I think I know the answer, but it should be
documented)

Why are these "type1" structures, shouldn't they be down below?

Do we need an extension or flag bit to describe these as present or is
it sufficient to call and fail?

Do we need two ioctls or one?

What about Documentation/vfio.txt?

>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 05/24] vfio: powerpc/spapr: Move locked_vm accounting to helpers
  2015-01-29  9:21   ` Alexey Kardashevskiy
@ 2015-02-03  0:12     ` Alex Williamson
  -1 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> There moves locked pages accounting to helpers.
> Later they will be reused for Dynamic DMA windows (DDW).
> 
> While we are here, update the comment explaining why RLIMIT_MEMLOCK
> might be required to be bigger than the guest RAM. This also prints
> pid of the current process in pr_warn/pr_debug.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 72 +++++++++++++++++++++++++++----------
>  1 file changed, 53 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c596053..29d5708 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -29,6 +29,47 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> +#define IOMMU_TABLE_PAGES(tbl) \
> +		(((tbl)->it_size << (tbl)->it_page_shift) >> PAGE_SHIFT)

A bit of an infringement on the global namespace with such a generic
name.

> +
> +static long try_increment_locked_vm(long npages)
> +{
> +	long ret = 0, locked, lock_limit;
> +
> +	if (!current || !current->mm)
> +		return -ESRCH; /* process exited */
> +
> +	down_write(&current->mm->mmap_sem);
> +	locked = current->mm->locked_vm + npages;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		pr_warn("[%d] RLIMIT_MEMLOCK (%ld) exceeded\n",
> +				current->pid, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +	} else {
> +		current->mm->locked_vm += npages;
> +	}
> +	pr_debug("[%d] RLIMIT_MEMLOCK+ %ld pages\n", current->pid,
> +			current->mm->locked_vm);
> +	up_write(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +static void decrement_locked_vm(long npages)
> +{
> +	if (!current || !current->mm)
> +		return; /* process exited */
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (npages > current->mm->locked_vm)
> +		npages = current->mm->locked_vm;
> +	current->mm->locked_vm -= npages;
> +	pr_debug("[%d] RLIMIT_MEMLOCK- %ld pages\n", current->pid,
> +			current->mm->locked_vm);
> +	up_write(&current->mm->mmap_sem);
> +}
> +
>  /*
>   * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
>   *
> @@ -66,8 +107,6 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> -	unsigned long locked, lock_limit, npages;
> -	struct iommu_table *tbl = container->tbl;
>  
>  	if (!container->tbl)
>  		return -ENXIO;
> @@ -95,21 +134,19 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
>  	 * that would effectively kill the guest at random points, much better
>  	 * enforcing the limit based on the max that the guest can map.
> +	 *
> +	 * Unfortunately at the moment it counts whole tables, no matter how
> +	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
> +	 * each with 2GB DMA window, 8GB will be counted here. The reason for
> +	 * this is that we cannot tell here the amount of RAM used by the guest
> +	 * as this information is only available from KVM and VFIO is
> +	 * KVM agnostic.
>  	 */
> -	down_write(&current->mm->mmap_sem);
> -	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
> -	locked = current->mm->locked_vm + npages;
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> -		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> -				rlimit(RLIMIT_MEMLOCK));
> -		ret = -ENOMEM;
> -	} else {
> +	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
> +	if (ret)
> +		return ret;
>  
> -		current->mm->locked_vm += npages;
> -		container->enabled = true;
> -	}
> -	up_write(&current->mm->mmap_sem);
> +	container->enabled = true;
>  
>  	return ret;
>  }
> @@ -124,10 +161,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  	if (!container->tbl || !current->mm)
>  		return;
>  
> -	down_write(&current->mm->mmap_sem);
> -	current->mm->locked_vm -= (container->tbl->it_size <<
> -			container->tbl->it_page_shift) >> PAGE_SHIFT;
> -	up_write(&current->mm->mmap_sem);
> +	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 05/24] vfio: powerpc/spapr: Move locked_vm accounting to helpers
@ 2015-02-03  0:12     ` Alex Williamson
  0 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> There moves locked pages accounting to helpers.
> Later they will be reused for Dynamic DMA windows (DDW).
> 
> While we are here, update the comment explaining why RLIMIT_MEMLOCK
> might be required to be bigger than the guest RAM. This also prints
> pid of the current process in pr_warn/pr_debug.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 72 +++++++++++++++++++++++++++----------
>  1 file changed, 53 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c596053..29d5708 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -29,6 +29,47 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> +#define IOMMU_TABLE_PAGES(tbl) \
> +		(((tbl)->it_size << (tbl)->it_page_shift) >> PAGE_SHIFT)

A bit of an infringement on the global namespace with such a generic
name.

> +
> +static long try_increment_locked_vm(long npages)
> +{
> +	long ret = 0, locked, lock_limit;
> +
> +	if (!current || !current->mm)
> +		return -ESRCH; /* process exited */
> +
> +	down_write(&current->mm->mmap_sem);
> +	locked = current->mm->locked_vm + npages;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		pr_warn("[%d] RLIMIT_MEMLOCK (%ld) exceeded\n",
> +				current->pid, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +	} else {
> +		current->mm->locked_vm += npages;
> +	}
> +	pr_debug("[%d] RLIMIT_MEMLOCK+ %ld pages\n", current->pid,
> +			current->mm->locked_vm);
> +	up_write(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +static void decrement_locked_vm(long npages)
> +{
> +	if (!current || !current->mm)
> +		return; /* process exited */
> +
> +	down_write(&current->mm->mmap_sem);
> +	if (npages > current->mm->locked_vm)
> +		npages = current->mm->locked_vm;
> +	current->mm->locked_vm -= npages;
> +	pr_debug("[%d] RLIMIT_MEMLOCK- %ld pages\n", current->pid,
> +			current->mm->locked_vm);
> +	up_write(&current->mm->mmap_sem);
> +}
> +
>  /*
>   * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
>   *
> @@ -66,8 +107,6 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> -	unsigned long locked, lock_limit, npages;
> -	struct iommu_table *tbl = container->tbl;
>  
>  	if (!container->tbl)
>  		return -ENXIO;
> @@ -95,21 +134,19 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
>  	 * that would effectively kill the guest at random points, much better
>  	 * enforcing the limit based on the max that the guest can map.
> +	 *
> +	 * Unfortunately at the moment it counts whole tables, no matter how
> +	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
> +	 * each with 2GB DMA window, 8GB will be counted here. The reason for
> +	 * this is that we cannot tell here the amount of RAM used by the guest
> +	 * as this information is only available from KVM and VFIO is
> +	 * KVM agnostic.
>  	 */
> -	down_write(&current->mm->mmap_sem);
> -	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
> -	locked = current->mm->locked_vm + npages;
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> -		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> -				rlimit(RLIMIT_MEMLOCK));
> -		ret = -ENOMEM;
> -	} else {
> +	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
> +	if (ret)
> +		return ret;
>  
> -		current->mm->locked_vm += npages;
> -		container->enabled = true;
> -	}
> -	up_write(&current->mm->mmap_sem);
> +	container->enabled = true;
>  
>  	return ret;
>  }
> @@ -124,10 +161,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  	if (!container->tbl || !current->mm)
>  		return;
>  
> -	down_write(&current->mm->mmap_sem);
> -	current->mm->locked_vm -= (container->tbl->it_size <<
> -			container->tbl->it_page_shift) >> PAGE_SHIFT;
> -	up_write(&current->mm->mmap_sem);
> +	decrement_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  2015-01-29  9:21   ` Alexey Kardashevskiy
@ 2015-02-03  0:12     ` Alex Williamson
  -1 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> Modern IBM POWERPC systems support multiple (currently two) TCE tables
> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
> for TCE tables. Right now just one table is supported.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>  arch/powerpc/kernel/eeh.c                   |   2 +-
>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>  9 files changed, 170 insertions(+), 83 deletions(-)
[snip]
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 29d5708..28909e1 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>   */
>  struct tce_container {
>  	struct mutex lock;
> -	struct iommu_table *tbl;
> +	struct iommu_group *grp;
>  	bool enabled;
>  };
>  
> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  	return false;
>  }
>  
> +static struct iommu_table *spapr_tce_find_table(
> +		struct tce_container *container,
> +		phys_addr_t ioba)
> +{
> +	long i;
> +	struct iommu_table *ret = NULL;
> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
> +
> +	mutex_lock(&container->lock);
> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = &iommu->tables[i];
> +		unsigned long entry = ioba >> tbl->it_page_shift;
> +		unsigned long start = tbl->it_offset;
> +		unsigned long end = start + tbl->it_size;
> +
> +		if ((start <= entry) && (entry < end)) {
> +			ret = tbl;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&container->lock);
> +
> +	return ret;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> +	struct powerpc_iommu *iommu;
> +	struct iommu_table *tbl;
>  
> -	if (!container->tbl)
> +	if (!container->grp)
>  		return -ENXIO;
>  
> -	if (!current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (container->enabled)
>  		return -EBUSY;
>  
> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * as this information is only available from KVM and VFIO is
>  	 * KVM agnostic.
>  	 */
> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
> +	iommu = iommu_group_get_iommudata(container->grp);
> +	if (!iommu)
> +		return -EFAULT;
> +
> +	tbl = &iommu->tables[0];


There should probably be a comment somewhere documenting that tables[0]
is the small window and presumably [1] will be the DDW.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
@ 2015-02-03  0:12     ` Alex Williamson
  0 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  0:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
> Modern IBM POWERPC systems support multiple (currently two) TCE tables
> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
> for TCE tables. Right now just one table is supported.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>  arch/powerpc/kernel/eeh.c                   |   2 +-
>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>  9 files changed, 170 insertions(+), 83 deletions(-)
[snip]
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 29d5708..28909e1 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>   */
>  struct tce_container {
>  	struct mutex lock;
> -	struct iommu_table *tbl;
> +	struct iommu_group *grp;
>  	bool enabled;
>  };
>  
> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>  	return false;
>  }
>  
> +static struct iommu_table *spapr_tce_find_table(
> +		struct tce_container *container,
> +		phys_addr_t ioba)
> +{
> +	long i;
> +	struct iommu_table *ret = NULL;
> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
> +
> +	mutex_lock(&container->lock);
> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = &iommu->tables[i];
> +		unsigned long entry = ioba >> tbl->it_page_shift;
> +		unsigned long start = tbl->it_offset;
> +		unsigned long end = start + tbl->it_size;
> +
> +		if ((start <= entry) && (entry < end)) {
> +			ret = tbl;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&container->lock);
> +
> +	return ret;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> +	struct powerpc_iommu *iommu;
> +	struct iommu_table *tbl;
>  
> -	if (!container->tbl)
> +	if (!container->grp)
>  		return -ENXIO;
>  
> -	if (!current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (container->enabled)
>  		return -EBUSY;
>  
> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * as this information is only available from KVM and VFIO is
>  	 * KVM agnostic.
>  	 */
> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
> +	iommu = iommu_group_get_iommudata(container->grp);
> +	if (!iommu)
> +		return -EFAULT;
> +
> +	tbl = &iommu->tables[0];


There should probably be a comment somewhere documenting that tables[0]
is the small window and presumably [1] will be the DDW.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 24/24] vfio: powerpc/spapr: Support Dynamic DMA windows
  2015-01-29  9:22   ` Alexey Kardashevskiy
@ 2015-02-03  2:53     ` Alex Williamson
  -1 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, 2015-01-29 at 20:22 +1100, Alexey Kardashevskiy wrote:
> This adds create/remove window ioctls to create and remove DMA windows.
> 
> This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
> information such as a number of supported windows and maximum number
> levels of TCE tables.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h    |   2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c | 137 +++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h           |  24 ++++++-
>  3 files changed, 160 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 33009f9..7ca1c8c 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -133,7 +133,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>  					    int nid);
>  
> -#define POWERPC_IOMMU_MAX_TABLES	1
> +#define POWERPC_IOMMU_MAX_TABLES	2
>  
>  #define POWERPC_IOMMU_DEFAULT_LEVELS	1
>  
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 8bcafb7..d3a1cc9 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -300,6 +300,20 @@ static struct iommu_table *spapr_tce_find_table(
>  	return ret;
>  }
>  
> +static int spapr_tce_find_free_table(struct tce_container *container)
> +{
> +	int i;
> +
> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = &container->tables[i];
> +
> +		if (!tbl->it_size)
> +			return i;
> +	}
> +
> +	return -1;
> +}
> +
>  static unsigned long tce_default_winsize(struct tce_container *container)
>  {
>  	struct tce_iommu_group *tcegrp;
> @@ -594,7 +608,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>  				 unsigned int cmd, unsigned long arg)
>  {
>  	struct tce_container *container = iommu_data;
> -	unsigned long minsz;
> +	unsigned long minsz, ddwsz;
>  	long ret;
>  
>  	switch (cmd) {
> @@ -636,6 +650,15 @@ static long tce_iommu_ioctl(void *iommu_data,
>  
>  		info.dma32_window_start = iommu->tce32_start;
>  		info.dma32_window_size = iommu->tce32_size;
> +		info.windows_supported = iommu->windows_supported;
> +		info.levels = iommu->levels;
> +		info.flags = iommu->flags;
> +
> +		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> +				levels);
> +
> +		if (info.argsz == ddwsz)
> +			minsz = ddwsz;
>  
>  		if (copy_to_user((void __user *)arg, &info, minsz))
>  			return -EFAULT;
> @@ -800,6 +823,118 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		return ret;
>  	}
>  
> +	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
> +		struct vfio_iommu_spapr_tce_create create;
> +		struct powerpc_iommu *iommu;
> +		struct tce_iommu_group *tcegrp;
> +		int num;
> +
> +		if (!tce_preregistered(container))
> +			return -ENXIO;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
> +				start_addr);
> +
> +		if (copy_from_user(&create, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (create.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (create.flags)
> +			return -EINVAL;
> +
> +		num = spapr_tce_find_free_table(container);
> +		if (num < 0)
> +			return -ENOSYS;
> +
> +		tcegrp = list_first_entry(&container->group_list,
> +				struct tce_iommu_group, next);
> +		iommu = iommu_group_get_iommudata(tcegrp->grp);
> +
> +		ret = iommu->ops->create_table(iommu, num,
> +				create.page_shift, create.window_shift,
> +				create.levels,
> +				&container->tables[num]);
> +		if (ret)
> +			return ret;
> +
> +		list_for_each_entry(tcegrp, &container->group_list, next) {
> +			struct powerpc_iommu *iommutmp =
> +					iommu_group_get_iommudata(tcegrp->grp);
> +
> +			if (WARN_ON_ONCE(iommutmp->ops != iommu->ops))
> +				return -EFAULT;
> +
> +			ret = iommu->ops->set_window(iommutmp, num,
> +					&container->tables[num]);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		create.start_addr =
> +				container->tables[num].it_offset <<
> +				container->tables[num].it_page_shift;
> +
> +		if (copy_to_user((void __user *)arg, &create, minsz))
> +			return -EFAULT;
> +
> +		mutex_lock(&container->lock);
> +		mutex_unlock(&container->lock);

Huh?

> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
> +		struct vfio_iommu_spapr_tce_remove remove;
> +		struct powerpc_iommu *iommu = NULL;
> +		struct iommu_table *tbl;
> +		struct tce_iommu_group *tcegrp;
> +		int num;
> +
> +		if (!tce_preregistered(container))
> +			return -ENXIO;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
> +				start_addr);
> +
> +		if (copy_from_user(&remove, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (remove.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (remove.flags)
> +			return -EINVAL;
> +
> +
> +		tbl = spapr_tce_find_table(container, remove.start_addr);
> +		if (!tbl)
> +			return -EINVAL;
> +
> +		/* Detach windows from IOMMUs */
> +		mutex_lock(&container->lock);
> +
> +		/* Detach groups from IOMMUs */
> +		num = tbl - container->tables;
> +		list_for_each_entry(tcegrp, &container->group_list, next) {
> +			iommu = iommu_group_get_iommudata(tcegrp->grp);
> +			if (container->tables[num].it_size)
> +				iommu->ops->unset_window(iommu, num);
> +		}
> +
> +		/* Free table */
> +		tcegrp = list_first_entry(&container->group_list,
> +				struct tce_iommu_group, next);
> +		iommu = iommu_group_get_iommudata(tcegrp->grp);
> +
> +		tce_iommu_clear(container, tbl,
> +				tbl->it_offset, tbl->it_size);
> +		iommu->ops->free_table(tbl);
> +
> +		mutex_unlock(&container->lock);
> +
> +		return 0;
> +	}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2bb0c9b..7ed7000 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -483,9 +483,11 @@ struct vfio_iommu_type1_unregister_memory {
>   */
>  struct vfio_iommu_spapr_tce_info {
>  	__u32 argsz;
> -	__u32 flags;			/* reserved for future use */
> +	__u32 flags;

So what are the flags?

>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
> +	__u32 windows_supported;
> +	__u32 levels;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> @@ -521,6 +523,26 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +struct vfio_iommu_spapr_tce_create {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u32 page_shift;
> +	__u32 window_shift;
> +	__u32 levels;
> +	/* out */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +struct vfio_iommu_spapr_tce_remove {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> +


Nak, -EINSUFFICIENTDOCS

>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */




^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 24/24] vfio: powerpc/spapr: Support Dynamic DMA windows
@ 2015-02-03  2:53     ` Alex Williamson
  0 siblings, 0 replies; 70+ messages in thread
From: Alex Williamson @ 2015-02-03  2:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On Thu, 2015-01-29 at 20:22 +1100, Alexey Kardashevskiy wrote:
> This adds create/remove window ioctls to create and remove DMA windows.
> 
> This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
> information such as a number of supported windows and maximum number
> levels of TCE tables.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/include/asm/iommu.h    |   2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c | 137 +++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h           |  24 ++++++-
>  3 files changed, 160 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 33009f9..7ca1c8c 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -133,7 +133,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>  					    int nid);
>  
> -#define POWERPC_IOMMU_MAX_TABLES	1
> +#define POWERPC_IOMMU_MAX_TABLES	2
>  
>  #define POWERPC_IOMMU_DEFAULT_LEVELS	1
>  
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 8bcafb7..d3a1cc9 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -300,6 +300,20 @@ static struct iommu_table *spapr_tce_find_table(
>  	return ret;
>  }
>  
> +static int spapr_tce_find_free_table(struct tce_container *container)
> +{
> +	int i;
> +
> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
> +		struct iommu_table *tbl = &container->tables[i];
> +
> +		if (!tbl->it_size)
> +			return i;
> +	}
> +
> +	return -1;
> +}
> +
>  static unsigned long tce_default_winsize(struct tce_container *container)
>  {
>  	struct tce_iommu_group *tcegrp;
> @@ -594,7 +608,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>  				 unsigned int cmd, unsigned long arg)
>  {
>  	struct tce_container *container = iommu_data;
> -	unsigned long minsz;
> +	unsigned long minsz, ddwsz;
>  	long ret;
>  
>  	switch (cmd) {
> @@ -636,6 +650,15 @@ static long tce_iommu_ioctl(void *iommu_data,
>  
>  		info.dma32_window_start = iommu->tce32_start;
>  		info.dma32_window_size = iommu->tce32_size;
> +		info.windows_supported = iommu->windows_supported;
> +		info.levels = iommu->levels;
> +		info.flags = iommu->flags;
> +
> +		ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> +				levels);
> +
> +		if (info.argsz == ddwsz)
> +			minsz = ddwsz;
>  
>  		if (copy_to_user((void __user *)arg, &info, minsz))
>  			return -EFAULT;
> @@ -800,6 +823,118 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		return ret;
>  	}
>  
> +	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
> +		struct vfio_iommu_spapr_tce_create create;
> +		struct powerpc_iommu *iommu;
> +		struct tce_iommu_group *tcegrp;
> +		int num;
> +
> +		if (!tce_preregistered(container))
> +			return -ENXIO;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
> +				start_addr);
> +
> +		if (copy_from_user(&create, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (create.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (create.flags)
> +			return -EINVAL;
> +
> +		num = spapr_tce_find_free_table(container);
> +		if (num < 0)
> +			return -ENOSYS;
> +
> +		tcegrp = list_first_entry(&container->group_list,
> +				struct tce_iommu_group, next);
> +		iommu = iommu_group_get_iommudata(tcegrp->grp);
> +
> +		ret = iommu->ops->create_table(iommu, num,
> +				create.page_shift, create.window_shift,
> +				create.levels,
> +				&container->tables[num]);
> +		if (ret)
> +			return ret;
> +
> +		list_for_each_entry(tcegrp, &container->group_list, next) {
> +			struct powerpc_iommu *iommutmp =
> +					iommu_group_get_iommudata(tcegrp->grp);
> +
> +			if (WARN_ON_ONCE(iommutmp->ops != iommu->ops))
> +				return -EFAULT;
> +
> +			ret = iommu->ops->set_window(iommutmp, num,
> +					&container->tables[num]);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		create.start_addr =
> +				container->tables[num].it_offset <<
> +				container->tables[num].it_page_shift;
> +
> +		if (copy_to_user((void __user *)arg, &create, minsz))
> +			return -EFAULT;
> +
> +		mutex_lock(&container->lock);
> +		mutex_unlock(&container->lock);

Huh?

> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
> +		struct vfio_iommu_spapr_tce_remove remove;
> +		struct powerpc_iommu *iommu = NULL;
> +		struct iommu_table *tbl;
> +		struct tce_iommu_group *tcegrp;
> +		int num;
> +
> +		if (!tce_preregistered(container))
> +			return -ENXIO;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
> +				start_addr);
> +
> +		if (copy_from_user(&remove, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (remove.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (remove.flags)
> +			return -EINVAL;
> +
> +
> +		tbl = spapr_tce_find_table(container, remove.start_addr);
> +		if (!tbl)
> +			return -EINVAL;
> +
> +		/* Detach windows from IOMMUs */
> +		mutex_lock(&container->lock);
> +
> +		/* Detach groups from IOMMUs */
> +		num = tbl - container->tables;
> +		list_for_each_entry(tcegrp, &container->group_list, next) {
> +			iommu = iommu_group_get_iommudata(tcegrp->grp);
> +			if (container->tables[num].it_size)
> +				iommu->ops->unset_window(iommu, num);
> +		}
> +
> +		/* Free table */
> +		tcegrp = list_first_entry(&container->group_list,
> +				struct tce_iommu_group, next);
> +		iommu = iommu_group_get_iommudata(tcegrp->grp);
> +
> +		tce_iommu_clear(container, tbl,
> +				tbl->it_offset, tbl->it_size);
> +		iommu->ops->free_table(tbl);
> +
> +		mutex_unlock(&container->lock);
> +
> +		return 0;
> +	}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2bb0c9b..7ed7000 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -483,9 +483,11 @@ struct vfio_iommu_type1_unregister_memory {
>   */
>  struct vfio_iommu_spapr_tce_info {
>  	__u32 argsz;
> -	__u32 flags;			/* reserved for future use */
> +	__u32 flags;

So what are the flags?

>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
> +	__u32 windows_supported;
> +	__u32 levels;
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> @@ -521,6 +523,26 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +struct vfio_iommu_spapr_tce_create {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u32 page_shift;
> +	__u32 window_shift;
> +	__u32 levels;
> +	/* out */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +struct vfio_iommu_spapr_tce_remove {
> +	__u32 argsz;
> +	__u32 flags;
> +	/* in */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> +


Nak, -EINSUFFICIENTDOCS

>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
  2015-02-03  0:11     ` Alex Williamson
@ 2015-02-03  5:51       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-03  5:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Graf, Alexander Gordeev,
	linux-kernel

On 02/03/2015 11:11 AM, Alex Williamson wrote:
> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>> The existing implementation accounts the whole DMA window in
>> the locked_vm counter which is going to be even worse with multiple
>> containers and huge DMA windows.
>>
>> This introduces 2 ioctls to register/unregister DMA memory which
>> receive user space address and size of the memory region which
>> needs to be pinned/unpinned and counted in locked_vm.
>>
>> If any memory region was registered, all subsequent DMA map requests
>> should address already pinned memory. If no memory was registered,
>> then the amount of memory required for a single default memory will be
>> accounted when the container is enabled and every map/unmap will pin/unpin
>> a page.
>>
>> Dynamic DMA window and in-kernel acceleration will require memory to
>> be registered in order to work.
>>
>> The accounting is done per VFIO container. When the support of
>> multiple groups per container is added, we will have accurate locked_vm
>> accounting.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
>>  include/uapi/linux/vfio.h           |  29 ++++
>>  2 files changed, 331 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 8256275..d0987ae 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -86,8 +86,169 @@ struct tce_container {
>>  	struct mutex lock;
>>  	struct iommu_group *grp;
>>  	bool enabled;
>> +	struct list_head mem_list;
>>  };
>>  
>> +struct tce_memory {
>> +	struct list_head next;
>> +	struct rcu_head rcu;
>> +	__u64 vaddr;
>> +	__u64 size;
>> +	__u64 pfns[];
>> +};
> 
> So we're using 2MB of kernel memory per 1G of user mapped memory, right?
> Or are we using bigger pages here?  I'm not sure the kmalloc below is
> the appropriate allocator for something that can be so large.

ok, vmalloc it is then.


>> +
>> +static void tce_unpin_pages(struct tce_container *container,
>> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
>> +{
>> +	__u64 off;
>> +	struct page *page = NULL;
>> +
>> +
>> +	for (off = 0; off < size; off += PAGE_SIZE) {
>> +		if (!mem->pfns[off >> PAGE_SHIFT])
>> +			continue;
>> +
>> +		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
>> +		if (!page)
>> +			continue;
>> +
>> +		put_page(page);
>> +		mem->pfns[off >> PAGE_SHIFT] = 0;
>> +	}
> 
> Seems cleaner to count by 1 rather than PAGE_SIZE (ie. shift size once
> instead of off 3 times).
>
>> +}
>> +
>> +static void release_tce_memory(struct rcu_head *head)
>> +{
>> +	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
>> +
>> +	kfree(mem);
>> +}
>> +
>> +static void tce_do_unregister_pages(struct tce_container *container,
>> +		struct tce_memory *mem)
>> +{
>> +	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
>> +	decrement_locked_vm(mem->size);
>> +	list_del_rcu(&mem->next);
>> +	call_rcu_sched(&mem->rcu, release_tce_memory);
>> +}
>> +
>> +static long tce_unregister_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	struct tce_memory *mem, *memtmp;
>> +
>> +	if (container->enabled)
>> +		return -EBUSY;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>> +		return -EINVAL;
>> +
>> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
>> +		if ((mem->vaddr == vaddr) && (mem->size == size)) {
>> +			tce_do_unregister_pages(container, mem);
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return -ENOENT;
>> +}
>> +
>> +static long tce_pin_pages(struct tce_container *container,
>> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
>> +{
>> +	__u64 off;
>> +	struct page *page = NULL;
>> +
>> +	for (off = 0; off < size; off += PAGE_SIZE) {
>> +		if (1 != get_user_pages_fast(vaddr + off,
>> +					1/* pages */, 1/* iswrite */, &page)) {
>> +			tce_unpin_pages(container, mem, vaddr, off);
>> +			return -EFAULT;
>> +		}
>> +
>> +		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static long tce_register_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	long ret;
>> +	struct tce_memory *mem;
>> +
>> +	if (container->enabled)
>> +		return -EBUSY;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>> +			((vaddr + size) < vaddr))
>> +		return -EINVAL;
>> +
>> +	/* Any overlap with registered chunks? */
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr < (vaddr + size)) &&
>> +				(vaddr < (mem->vaddr + mem->size))) {
>> +			ret = -EBUSY;
>> +			goto unlock_exit;
>> +		}
>> +	}
>> +
>> +	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
>> +	if (ret)
>> +		goto unlock_exit;
>> +
>> +	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);
> 
> 
> I suspect that userspace can break kmalloc with the potential size of
> this structure.  You might need a vmalloc.  I also wonder if there isn't
> a more efficient tree structure to use.


Right, I'll use vmalloc. All this time I was thinking kmalloc() allocates
non-contiguous memory :) How would the tree be more efficient here? I store
pfns once and unpin them once as well.



>> +	if (!mem)
>> +		goto unlock_exit;
>> +
>> +	if (tce_pin_pages(container, mem, vaddr, size))
>> +		goto free_exit;
>> +
>> +	mem->vaddr = vaddr;
>> +	mem->size = size;
>> +
>> +	list_add_rcu(&mem->next, &container->mem_list);
>> +	rcu_read_unlock();
>> +
>> +	return 0;
>> +
>> +free_exit:
>> +	kfree(mem);
>> +
>> +unlock_exit:
>> +	decrement_locked_vm(size >> PAGE_SHIFT);
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +static inline bool tce_preregistered(struct tce_container *container)
>> +{
>> +	return !list_empty(&container->mem_list);
>> +}
>> +
>> +static bool tce_pinned(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	struct tce_memory *mem;
>> +	bool ret = false;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr <= vaddr) &&
>> +				(vaddr + size <= mem->vaddr + mem->size)) {
>> +			ret = true;
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>>  static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>  {
>>  	unsigned shift;
>> @@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * as this information is only available from KVM and VFIO is
>>  	 * KVM agnostic.
>>  	 */
>> -	iommu = iommu_group_get_iommudata(container->grp);
>> -	if (!iommu)
>> -		return -EFAULT;
>> +	if (!tce_preregistered(container)) {
>> +		iommu = iommu_group_get_iommudata(container->grp);
>> +		if (!iommu)
>> +			return -EFAULT;
>>  
>> -	tbl = &iommu->tables[0];
>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> -	if (ret)
>> -		return ret;
>> +		tbl = &iommu->tables[0];
>> +		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +		if (ret)
>> +			return ret;
>> +	}
>>  
>>  	container->enabled = true;
>>  
>> @@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
>>  	if (!container->grp || !current->mm)
>>  		return;
>>  
>> -	iommu = iommu_group_get_iommudata(container->grp);
>> -	if (!iommu)
>> -		return;
>> +	if (!tce_preregistered(container)) {
>> +		iommu = iommu_group_get_iommudata(container->grp);
>> +		if (!iommu)
>> +			return;
>>  
>> -	tbl = &iommu->tables[0];
>> -	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +		tbl = &iommu->tables[0];
>> +		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +	}
>>  }
>>  
>>  static void *tce_iommu_open(unsigned long arg)
>> @@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
>>  		return ERR_PTR(-ENOMEM);
>>  
>>  	mutex_init(&container->lock);
>> +	INIT_LIST_HEAD_RCU(&container->mem_list);
>>  
>>  	return container;
>>  }
>> @@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
>>  static void tce_iommu_release(void *iommu_data)
>>  {
>>  	struct tce_container *container = iommu_data;
>> +	struct tce_memory *mem, *memtmp;
>>  
>>  	WARN_ON(container->grp);
>>  	tce_iommu_disable(container);
>> @@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
>>  	if (container->grp)
>>  		tce_iommu_detach_group(iommu_data, container->grp);
>>  
>> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
>> +		tce_do_unregister_pages(container, mem);
>> +
>>  	mutex_destroy(&container->lock);
>>  
>>  	kfree(container);
>>  }
>>  
>> -static void tce_iommu_unuse_page(unsigned long oldtce)
>> +static void tce_iommu_unuse_page(struct tce_container *container,
>> +		unsigned long oldtce)
>>  {
>>  	struct page *page;
>> +	bool do_put = !tce_preregistered(container);
>>  
>>  	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>>  		return;
>> @@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
>>  	if (oldtce & TCE_PCI_WRITE)
>>  		SetPageDirty(page);
>>  
>> -	put_page(page);
>> +	if (do_put)
>> +		put_page(page);
>>  }
>>  
>>  static int tce_iommu_clear(struct tce_container *container,
>> @@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
>>  		if (ret)
>>  			continue;
>>  
>> -		tce_iommu_unuse_page(oldtce);
>> +		tce_iommu_unuse_page(container, oldtce);
>>  	}
>>  
>>  	return 0;
>> @@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
>>  		return DMA_NONE;
>>  }
>>  
>> +static unsigned long tce_get_hva_cached(struct tce_container *container,
>> +		unsigned page_shift, unsigned long tce)
>> +{
>> +	struct tce_memory *mem;
>> +	struct page *page = NULL;
>> +	unsigned long hva = -1;
>> +
>> +	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
>> +			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
>> +			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
>> +
>> +			page = pfn_to_page(mem->pfns[gfn]);
>> +
>> +			if (!tce_check_page_size(page, page_shift))
>> +				break;
>> +
>> +			hva = (unsigned long) __va(hpa);
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return hva;
>> +}
>> +
>> +static unsigned long tce_get_hva(struct tce_container *container,
>> +		unsigned page_shift, unsigned long tce)
>> +{
>> +	long ret = 0;
>> +	struct page *page = NULL;
>> +	unsigned long hva = -1;
>> +	enum dma_data_direction direction = tce_iommu_direction(tce);
>> +
>> +	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
>> +			direction != DMA_TO_DEVICE, &page);
>> +	if (unlikely(ret != 1))
>> +		return -1;
>> +
>> +	if (!tce_check_page_size(page, page_shift)) {
>> +		put_page(page);
>> +		return -1;
>> +	}
>> +
>> +	hva = (unsigned long) page_address(page) +
>> +		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
>> +
>> +	return hva;
>> +}
>> +
>>  static long tce_iommu_build(struct tce_container *container,
>>  		struct iommu_table *tbl,
>>  		unsigned long entry, unsigned long tce, unsigned long pages)
>>  {
>>  	long i, ret = 0;
>> -	struct page *page = NULL;
>>  	unsigned long hva, oldtce;
>>  	enum dma_data_direction direction = tce_iommu_direction(tce);
>> +	bool do_put = false;
>>  
>>  	for (i = 0; i < pages; ++i) {
>> -		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
>> -				direction != DMA_TO_DEVICE, &page);
>> -		if (unlikely(ret != 1)) {
>> -			ret = -EFAULT;
>> -			break;
>> +		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
>> +		if (hva == -1) {
>> +			do_put = true;
>> +			WARN_ON_ONCE(1);
>> +			hva = tce_get_hva(container, tbl->it_page_shift, tce);
>>  		}
>>  
>> -		if (!tce_check_page_size(page, tbl->it_page_shift)) {
>> -			ret = -EFAULT;
>> -			break;
>> -		}
>> -
>> -		hva = (unsigned long) page_address(page) +
>> -			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>>  		oldtce = 0;
>> -
>>  		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
>>  		if (ret) {
>> -			put_page(page);
>> +			if (do_put)
>> +				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>>  					__func__, entry << tbl->it_page_shift,
>>  					tce, ret);
>>  			break;
>>  		}
>>  
>> -		tce_iommu_unuse_page(oldtce);
>> +		if (do_put)
>> +			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>> +
>> +		tce_iommu_unuse_page(container, oldtce);
>> +
>>  		tce += IOMMU_PAGE_SIZE(tbl);
>>  	}
>>  
>> @@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		if (ret)
>>  			return ret;
>>  
>> +		/* If any memory is pinned, only allow pages from that region */
>> +		if (tce_preregistered(container) &&
>> +				!tce_pinned(container, param.vaddr, param.size))
>> +			return -EPERM;
>> +
>>  		ret = tce_iommu_build(container, tbl,
>>  				param.iova >> tbl->it_page_shift,
>>  				tce, param.size >> tbl->it_page_shift);
>> @@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  
>>  		return ret;
>>  	}
>> +	case VFIO_IOMMU_REGISTER_MEMORY: {
>> +		struct vfio_iommu_type1_register_memory param;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
>> +				size);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		/* No flag is supported now */
>> +		if (param.flags)
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&container->lock);
>> +		ret = tce_register_pages(container, param.vaddr, param.size);
>> +		mutex_unlock(&container->lock);
>> +
>> +		return ret;
>> +	}
>> +	case VFIO_IOMMU_UNREGISTER_MEMORY: {
>> +		struct vfio_iommu_type1_unregister_memory param;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
>> +				size);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		/* No flag is supported now */
>> +		if (param.flags)
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&container->lock);
>> +		tce_unregister_pages(container, param.vaddr, param.size);
>> +		mutex_unlock(&container->lock);
>> +
>> +		return 0;
>> +	}
>>  	case VFIO_IOMMU_ENABLE:
>>  		mutex_lock(&container->lock);
>>  		ret = tce_iommu_enable(container);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 29715d2..2bb0c9b 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
>>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
>> + *
>> + * Registers user space memory where DMA is allowed. It pins
>> + * user pages and does the locked memory accounting so
>> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
>> + * get simpler.
>> + */
>> +struct vfio_iommu_type1_register_memory {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__u64	vaddr;				/* Process virtual address */
>> +	__u64	size;				/* Size of mapping (bytes) */
>> +};
>> +#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>> +/**
>> + * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
>> + *
>> + * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
>> + */
>> +struct vfio_iommu_type1_unregister_memory {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__u64	vaddr;				/* Process virtual address */
>> +	__u64	size;				/* Size of mapping (bytes) */
>> +};
>> +#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
>> +
> 
> Is the user allowed to unregister arbitrary sub-regions of previously
> registered memory?  (I think I know the answer, but it should be
> documented)

The answer is "no" :) I'll update Documentation/vfio.txt.


> Why are these "type1" structures, shouldn't they be down below?


Pretty much because these do not look like they do anything
powerpc-specific from the userspace prospective. Like DMA map/unmap.


> Do we need an extension or flag bit to describe these as present or is
> it sufficient to call and fail?

Sorry, I do not follow you here. Flag to describe what as present? As it is
now, in QEMU I setup a memory listener which walks through all RAM regions
and calls VFIO_IOMMU_REGISTER_MEMORY for every slot, once when the
container is started being used and I expect that this can fail (because of
RLIMIT, etc).


> Do we need two ioctls or one?

There are map/unmap, enable/disable, set/unset container couples so I
thought it would look natural if it was pin/unpin couple, no?


> What about Documentation/vfio.txt?

Yep. Thanks for the review!


> 
>>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>  
>>  /*
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 14/24] vfio: powerpc/spapr: Register memory
@ 2015-02-03  5:51       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-03  5:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexander Graf, Gavin Shan, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev, linux-kernel

On 02/03/2015 11:11 AM, Alex Williamson wrote:
> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>> The existing implementation accounts the whole DMA window in
>> the locked_vm counter which is going to be even worse with multiple
>> containers and huge DMA windows.
>>
>> This introduces 2 ioctls to register/unregister DMA memory which
>> receive user space address and size of the memory region which
>> needs to be pinned/unpinned and counted in locked_vm.
>>
>> If any memory region was registered, all subsequent DMA map requests
>> should address already pinned memory. If no memory was registered,
>> then the amount of memory required for a single default memory will be
>> accounted when the container is enabled and every map/unmap will pin/unpin
>> a page.
>>
>> Dynamic DMA window and in-kernel acceleration will require memory to
>> be registered in order to work.
>>
>> The accounting is done per VFIO container. When the support of
>> multiple groups per container is added, we will have accurate locked_vm
>> accounting.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 333 ++++++++++++++++++++++++++++++++----
>>  include/uapi/linux/vfio.h           |  29 ++++
>>  2 files changed, 331 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 8256275..d0987ae 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -86,8 +86,169 @@ struct tce_container {
>>  	struct mutex lock;
>>  	struct iommu_group *grp;
>>  	bool enabled;
>> +	struct list_head mem_list;
>>  };
>>  
>> +struct tce_memory {
>> +	struct list_head next;
>> +	struct rcu_head rcu;
>> +	__u64 vaddr;
>> +	__u64 size;
>> +	__u64 pfns[];
>> +};
> 
> So we're using 2MB of kernel memory per 1G of user mapped memory, right?
> Or are we using bigger pages here?  I'm not sure the kmalloc below is
> the appropriate allocator for something that can be so large.

ok, vmalloc it is then.


>> +
>> +static void tce_unpin_pages(struct tce_container *container,
>> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
>> +{
>> +	__u64 off;
>> +	struct page *page = NULL;
>> +
>> +
>> +	for (off = 0; off < size; off += PAGE_SIZE) {
>> +		if (!mem->pfns[off >> PAGE_SHIFT])
>> +			continue;
>> +
>> +		page = pfn_to_page(mem->pfns[off >> PAGE_SHIFT]);
>> +		if (!page)
>> +			continue;
>> +
>> +		put_page(page);
>> +		mem->pfns[off >> PAGE_SHIFT] = 0;
>> +	}
> 
> Seems cleaner to count by 1 rather than PAGE_SIZE (ie. shift size once
> instead of off 3 times).
>
>> +}
>> +
>> +static void release_tce_memory(struct rcu_head *head)
>> +{
>> +	struct tce_memory *mem = container_of(head, struct tce_memory, rcu);
>> +
>> +	kfree(mem);
>> +}
>> +
>> +static void tce_do_unregister_pages(struct tce_container *container,
>> +		struct tce_memory *mem)
>> +{
>> +	tce_unpin_pages(container, mem, mem->vaddr, mem->size);
>> +	decrement_locked_vm(mem->size);
>> +	list_del_rcu(&mem->next);
>> +	call_rcu_sched(&mem->rcu, release_tce_memory);
>> +}
>> +
>> +static long tce_unregister_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	struct tce_memory *mem, *memtmp;
>> +
>> +	if (container->enabled)
>> +		return -EBUSY;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>> +		return -EINVAL;
>> +
>> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next) {
>> +		if ((mem->vaddr == vaddr) && (mem->size == size)) {
>> +			tce_do_unregister_pages(container, mem);
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return -ENOENT;
>> +}
>> +
>> +static long tce_pin_pages(struct tce_container *container,
>> +		struct tce_memory *mem, __u64 vaddr, __u64 size)
>> +{
>> +	__u64 off;
>> +	struct page *page = NULL;
>> +
>> +	for (off = 0; off < size; off += PAGE_SIZE) {
>> +		if (1 != get_user_pages_fast(vaddr + off,
>> +					1/* pages */, 1/* iswrite */, &page)) {
>> +			tce_unpin_pages(container, mem, vaddr, off);
>> +			return -EFAULT;
>> +		}
>> +
>> +		mem->pfns[off >> PAGE_SHIFT] = page_to_pfn(page);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static long tce_register_pages(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	long ret;
>> +	struct tce_memory *mem;
>> +
>> +	if (container->enabled)
>> +		return -EBUSY;
>> +
>> +	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>> +			((vaddr + size) < vaddr))
>> +		return -EINVAL;
>> +
>> +	/* Any overlap with registered chunks? */
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr < (vaddr + size)) &&
>> +				(vaddr < (mem->vaddr + mem->size))) {
>> +			ret = -EBUSY;
>> +			goto unlock_exit;
>> +		}
>> +	}
>> +
>> +	ret = try_increment_locked_vm(size >> PAGE_SHIFT);
>> +	if (ret)
>> +		goto unlock_exit;
>> +
>> +	mem = kzalloc(sizeof(*mem) + (size >> (PAGE_SHIFT - 3)), GFP_KERNEL);
> 
> 
> I suspect that userspace can break kmalloc with the potential size of
> this structure.  You might need a vmalloc.  I also wonder if there isn't
> a more efficient tree structure to use.


Right, I'll use vmalloc. All this time I was thinking kmalloc() allocates
non-contiguous memory :) How would the tree be more efficient here? I store
pfns once and unpin them once as well.



>> +	if (!mem)
>> +		goto unlock_exit;
>> +
>> +	if (tce_pin_pages(container, mem, vaddr, size))
>> +		goto free_exit;
>> +
>> +	mem->vaddr = vaddr;
>> +	mem->size = size;
>> +
>> +	list_add_rcu(&mem->next, &container->mem_list);
>> +	rcu_read_unlock();
>> +
>> +	return 0;
>> +
>> +free_exit:
>> +	kfree(mem);
>> +
>> +unlock_exit:
>> +	decrement_locked_vm(size >> PAGE_SHIFT);
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +static inline bool tce_preregistered(struct tce_container *container)
>> +{
>> +	return !list_empty(&container->mem_list);
>> +}
>> +
>> +static bool tce_pinned(struct tce_container *container,
>> +		__u64 vaddr, __u64 size)
>> +{
>> +	struct tce_memory *mem;
>> +	bool ret = false;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr <= vaddr) &&
>> +				(vaddr + size <= mem->vaddr + mem->size)) {
>> +			ret = true;
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>>  static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>  {
>>  	unsigned shift;
>> @@ -166,14 +327,16 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * as this information is only available from KVM and VFIO is
>>  	 * KVM agnostic.
>>  	 */
>> -	iommu = iommu_group_get_iommudata(container->grp);
>> -	if (!iommu)
>> -		return -EFAULT;
>> +	if (!tce_preregistered(container)) {
>> +		iommu = iommu_group_get_iommudata(container->grp);
>> +		if (!iommu)
>> +			return -EFAULT;
>>  
>> -	tbl = &iommu->tables[0];
>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> -	if (ret)
>> -		return ret;
>> +		tbl = &iommu->tables[0];
>> +		ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +		if (ret)
>> +			return ret;
>> +	}
>>  
>>  	container->enabled = true;
>>  
>> @@ -193,12 +356,14 @@ static void tce_iommu_disable(struct tce_container *container)
>>  	if (!container->grp || !current->mm)
>>  		return;
>>  
>> -	iommu = iommu_group_get_iommudata(container->grp);
>> -	if (!iommu)
>> -		return;
>> +	if (!tce_preregistered(container)) {
>> +		iommu = iommu_group_get_iommudata(container->grp);
>> +		if (!iommu)
>> +			return;
>>  
>> -	tbl = &iommu->tables[0];
>> -	decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +		tbl = &iommu->tables[0];
>> +		decrement_locked_vm(IOMMU_TABLE_PAGES(tbl));
>> +	}
>>  }
>>  
>>  static void *tce_iommu_open(unsigned long arg)
>> @@ -215,6 +380,7 @@ static void *tce_iommu_open(unsigned long arg)
>>  		return ERR_PTR(-ENOMEM);
>>  
>>  	mutex_init(&container->lock);
>> +	INIT_LIST_HEAD_RCU(&container->mem_list);
>>  
>>  	return container;
>>  }
>> @@ -222,6 +388,7 @@ static void *tce_iommu_open(unsigned long arg)
>>  static void tce_iommu_release(void *iommu_data)
>>  {
>>  	struct tce_container *container = iommu_data;
>> +	struct tce_memory *mem, *memtmp;
>>  
>>  	WARN_ON(container->grp);
>>  	tce_iommu_disable(container);
>> @@ -229,14 +396,19 @@ static void tce_iommu_release(void *iommu_data)
>>  	if (container->grp)
>>  		tce_iommu_detach_group(iommu_data, container->grp);
>>  
>> +	list_for_each_entry_safe(mem, memtmp, &container->mem_list, next)
>> +		tce_do_unregister_pages(container, mem);
>> +
>>  	mutex_destroy(&container->lock);
>>  
>>  	kfree(container);
>>  }
>>  
>> -static void tce_iommu_unuse_page(unsigned long oldtce)
>> +static void tce_iommu_unuse_page(struct tce_container *container,
>> +		unsigned long oldtce)
>>  {
>>  	struct page *page;
>> +	bool do_put = !tce_preregistered(container);
>>  
>>  	if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
>>  		return;
>> @@ -245,7 +417,8 @@ static void tce_iommu_unuse_page(unsigned long oldtce)
>>  	if (oldtce & TCE_PCI_WRITE)
>>  		SetPageDirty(page);
>>  
>> -	put_page(page);
>> +	if (do_put)
>> +		put_page(page);
>>  }
>>  
>>  static int tce_iommu_clear(struct tce_container *container,
>> @@ -261,7 +434,7 @@ static int tce_iommu_clear(struct tce_container *container,
>>  		if (ret)
>>  			continue;
>>  
>> -		tce_iommu_unuse_page(oldtce);
>> +		tce_iommu_unuse_page(container, oldtce);
>>  	}
>>  
>>  	return 0;
>> @@ -279,42 +452,91 @@ static enum dma_data_direction tce_iommu_direction(unsigned long tce)
>>  		return DMA_NONE;
>>  }
>>  
>> +static unsigned long tce_get_hva_cached(struct tce_container *container,
>> +		unsigned page_shift, unsigned long tce)
>> +{
>> +	struct tce_memory *mem;
>> +	struct page *page = NULL;
>> +	unsigned long hva = -1;
>> +
>> +	tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE);
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(mem, &container->mem_list, next) {
>> +		if ((mem->vaddr <= tce) && (tce < (mem->vaddr + mem->size))) {
>> +			unsigned long gfn = (tce - mem->vaddr) >> PAGE_SHIFT;
>> +			unsigned long hpa = mem->pfns[gfn] << PAGE_SHIFT;
>> +
>> +			page = pfn_to_page(mem->pfns[gfn]);
>> +
>> +			if (!tce_check_page_size(page, page_shift))
>> +				break;
>> +
>> +			hva = (unsigned long) __va(hpa);
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return hva;
>> +}
>> +
>> +static unsigned long tce_get_hva(struct tce_container *container,
>> +		unsigned page_shift, unsigned long tce)
>> +{
>> +	long ret = 0;
>> +	struct page *page = NULL;
>> +	unsigned long hva = -1;
>> +	enum dma_data_direction direction = tce_iommu_direction(tce);
>> +
>> +	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
>> +			direction != DMA_TO_DEVICE, &page);
>> +	if (unlikely(ret != 1))
>> +		return -1;
>> +
>> +	if (!tce_check_page_size(page, page_shift)) {
>> +		put_page(page);
>> +		return -1;
>> +	}
>> +
>> +	hva = (unsigned long) page_address(page) +
>> +		(tce & ~((1ULL << page_shift) - 1) & ~PAGE_MASK);
>> +
>> +	return hva;
>> +}
>> +
>>  static long tce_iommu_build(struct tce_container *container,
>>  		struct iommu_table *tbl,
>>  		unsigned long entry, unsigned long tce, unsigned long pages)
>>  {
>>  	long i, ret = 0;
>> -	struct page *page = NULL;
>>  	unsigned long hva, oldtce;
>>  	enum dma_data_direction direction = tce_iommu_direction(tce);
>> +	bool do_put = false;
>>  
>>  	for (i = 0; i < pages; ++i) {
>> -		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
>> -				direction != DMA_TO_DEVICE, &page);
>> -		if (unlikely(ret != 1)) {
>> -			ret = -EFAULT;
>> -			break;
>> +		hva = tce_get_hva_cached(container, tbl->it_page_shift, tce);
>> +		if (hva == -1) {
>> +			do_put = true;
>> +			WARN_ON_ONCE(1);
>> +			hva = tce_get_hva(container, tbl->it_page_shift, tce);
>>  		}
>>  
>> -		if (!tce_check_page_size(page, tbl->it_page_shift)) {
>> -			ret = -EFAULT;
>> -			break;
>> -		}
>> -
>> -		hva = (unsigned long) page_address(page) +
>> -			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>>  		oldtce = 0;
>> -
>>  		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
>>  		if (ret) {
>> -			put_page(page);
>> +			if (do_put)
>> +				put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>>  					__func__, entry << tbl->it_page_shift,
>>  					tce, ret);
>>  			break;
>>  		}
>>  
>> -		tce_iommu_unuse_page(oldtce);
>> +		if (do_put)
>> +			put_page(pfn_to_page(__pa(hva) >> PAGE_SHIFT));
>> +
>> +		tce_iommu_unuse_page(container, oldtce);
>> +
>>  		tce += IOMMU_PAGE_SIZE(tbl);
>>  	}
>>  
>> @@ -416,6 +638,11 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  		if (ret)
>>  			return ret;
>>  
>> +		/* If any memory is pinned, only allow pages from that region */
>> +		if (tce_preregistered(container) &&
>> +				!tce_pinned(container, param.vaddr, param.size))
>> +			return -EPERM;
>> +
>>  		ret = tce_iommu_build(container, tbl,
>>  				param.iova >> tbl->it_page_shift,
>>  				tce, param.size >> tbl->it_page_shift);
>> @@ -464,6 +691,50 @@ static long tce_iommu_ioctl(void *iommu_data,
>>  
>>  		return ret;
>>  	}
>> +	case VFIO_IOMMU_REGISTER_MEMORY: {
>> +		struct vfio_iommu_type1_register_memory param;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_register_memory,
>> +				size);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		/* No flag is supported now */
>> +		if (param.flags)
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&container->lock);
>> +		ret = tce_register_pages(container, param.vaddr, param.size);
>> +		mutex_unlock(&container->lock);
>> +
>> +		return ret;
>> +	}
>> +	case VFIO_IOMMU_UNREGISTER_MEMORY: {
>> +		struct vfio_iommu_type1_unregister_memory param;
>> +
>> +		minsz = offsetofend(struct vfio_iommu_type1_unregister_memory,
>> +				size);
>> +
>> +		if (copy_from_user(&param, (void __user *)arg, minsz))
>> +			return -EFAULT;
>> +
>> +		if (param.argsz < minsz)
>> +			return -EINVAL;
>> +
>> +		/* No flag is supported now */
>> +		if (param.flags)
>> +			return -EINVAL;
>> +
>> +		mutex_lock(&container->lock);
>> +		tce_unregister_pages(container, param.vaddr, param.size);
>> +		mutex_unlock(&container->lock);
>> +
>> +		return 0;
>> +	}
>>  	case VFIO_IOMMU_ENABLE:
>>  		mutex_lock(&container->lock);
>>  		ret = tce_iommu_enable(container);
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 29715d2..2bb0c9b 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -437,6 +437,35 @@ struct vfio_iommu_type1_dma_unmap {
>>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>  
>> +/**
>> + * VFIO_IOMMU_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_type1_register_memory)
>> + *
>> + * Registers user space memory where DMA is allowed. It pins
>> + * user pages and does the locked memory accounting so
>> + * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
>> + * get simpler.
>> + */
>> +struct vfio_iommu_type1_register_memory {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__u64	vaddr;				/* Process virtual address */
>> +	__u64	size;				/* Size of mapping (bytes) */
>> +};
>> +#define VFIO_IOMMU_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
>> +
>> +/**
>> + * VFIO_IOMMU_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_type1_unregister_memory)
>> + *
>> + * Unregisters user space memory registered with VFIO_IOMMU_REGISTER_MEMORY.
>> + */
>> +struct vfio_iommu_type1_unregister_memory {
>> +	__u32	argsz;
>> +	__u32	flags;
>> +	__u64	vaddr;				/* Process virtual address */
>> +	__u64	size;				/* Size of mapping (bytes) */
>> +};
>> +#define VFIO_IOMMU_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
>> +
> 
> Is the user allowed to unregister arbitrary sub-regions of previously
> registered memory?  (I think I know the answer, but it should be
> documented)

The answer is "no" :) I'll update Documentation/vfio.txt.


> Why are these "type1" structures, shouldn't they be down below?


Pretty much because these do not look like they do anything
powerpc-specific from the userspace prospective. Like DMA map/unmap.


> Do we need an extension or flag bit to describe these as present or is
> it sufficient to call and fail?

Sorry, I do not follow you here. Flag to describe what as present? As it is
now, in QEMU I setup a memory listener which walks through all RAM regions
and calls VFIO_IOMMU_REGISTER_MEMORY for every slot, once when the
container is started being used and I expect that this can fail (because of
RLIMIT, etc).


> Do we need two ioctls or one?

There are map/unmap, enable/disable, set/unset container couples so I
thought it would look natural if it was pin/unpin couple, no?


> What about Documentation/vfio.txt?

Yep. Thanks for the review!


> 
>>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>>  
>>  /*
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
  2015-01-29  9:21   ` Alexey Kardashevskiy
@ 2015-02-04  6:08     ` Paul Mackerras
  -1 siblings, 0 replies; 70+ messages in thread
From: Paul Mackerras @ 2015-02-04  6:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Michael Ellerman,
	Gavin Shan, Alex Williamson, Alexander Graf, Alexander Gordeev,
	linux-kernel

On Thu, Jan 29, 2015 at 08:21:53PM +1100, Alexey Kardashevskiy wrote:
> At the moment writing new TCE value to the IOMMU table fails with EBUSY
> if there is a valid entry already. However PAPR specification allows
> the guest to write new TCE value without clearing it first.
> 
> Another problem this patch is addressing is the use of pool locks for
> external IOMMU users such as VFIO. The pool locks are to protect
> DMA page allocator rather than entries and since the host kernel does
> not control what pages are in use, there is no point in pool locks and
> exchange()+put_page(oldtce) is sufficient to avoid possible races.
> 
> This adds an exchange() callback to iommu_table_ops which does the same
> thing as set() plus it returns replaced TCE(s) so the caller can release
> the pages afterwards.
> 
> This implements exchange() for IODA2 only. This adds a requirement
> for a platform to have exchange() implemented so from now on IODA2 is
> the only supported PHB for VFIO-SPAPR.
> 
> This replaces iommu_tce_build() and iommu_clear_tce() with
> a single iommu_tce_xchg().

[snip]

> @@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
>  
>  		hva = (unsigned long) page_address(page) +
>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
> +		oldtce = 0;
>  
> -		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
> +		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);

Is the change from entry + 1 to entry + i here an actual bug fix?
If so please mention it in the patch description.

Paul.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
@ 2015-02-04  6:08     ` Paul Mackerras
  0 siblings, 0 replies; 70+ messages in thread
From: Paul Mackerras @ 2015-02-04  6:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Gavin Shan, Alexander Graf, Alex Williamson, Alexander Gordeev,
	linuxppc-dev, linux-kernel

On Thu, Jan 29, 2015 at 08:21:53PM +1100, Alexey Kardashevskiy wrote:
> At the moment writing new TCE value to the IOMMU table fails with EBUSY
> if there is a valid entry already. However PAPR specification allows
> the guest to write new TCE value without clearing it first.
> 
> Another problem this patch is addressing is the use of pool locks for
> external IOMMU users such as VFIO. The pool locks are to protect
> DMA page allocator rather than entries and since the host kernel does
> not control what pages are in use, there is no point in pool locks and
> exchange()+put_page(oldtce) is sufficient to avoid possible races.
> 
> This adds an exchange() callback to iommu_table_ops which does the same
> thing as set() plus it returns replaced TCE(s) so the caller can release
> the pages afterwards.
> 
> This implements exchange() for IODA2 only. This adds a requirement
> for a platform to have exchange() implemented so from now on IODA2 is
> the only supported PHB for VFIO-SPAPR.
> 
> This replaces iommu_tce_build() and iommu_clear_tce() with
> a single iommu_tce_xchg().

[snip]

> @@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
>  
>  		hva = (unsigned long) page_address(page) +
>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
> +		oldtce = 0;
>  
> -		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
> +		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);

Is the change from entry + 1 to entry + i here an actual bug fix?
If so please mention it in the patch description.

Paul.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  2015-02-03  0:12     ` Alex Williamson
@ 2015-02-04 13:32       ` Alexander Graf
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexander Graf @ 2015-02-04 13:32 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Gordeev, linux-kernel



On 03.02.15 01:12, Alex Williamson wrote:
> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
>> for TCE tables. Right now just one table is supported.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>>  arch/powerpc/kernel/eeh.c                   |   2 +-
>>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>>  9 files changed, 170 insertions(+), 83 deletions(-)
> [snip]
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 29d5708..28909e1 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>>   */
>>  struct tce_container {
>>  	struct mutex lock;
>> -	struct iommu_table *tbl;
>> +	struct iommu_group *grp;
>>  	bool enabled;
>>  };
>>  
>> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>  	return false;
>>  }
>>  
>> +static struct iommu_table *spapr_tce_find_table(
>> +		struct tce_container *container,
>> +		phys_addr_t ioba)
>> +{
>> +	long i;
>> +	struct iommu_table *ret = NULL;
>> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
>> +
>> +	mutex_lock(&container->lock);
>> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbl = &iommu->tables[i];
>> +		unsigned long entry = ioba >> tbl->it_page_shift;
>> +		unsigned long start = tbl->it_offset;
>> +		unsigned long end = start + tbl->it_size;
>> +
>> +		if ((start <= entry) && (entry < end)) {
>> +			ret = tbl;
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&container->lock);
>> +
>> +	return ret;
>> +}
>> +
>>  static int tce_iommu_enable(struct tce_container *container)
>>  {
>>  	int ret = 0;
>> +	struct powerpc_iommu *iommu;
>> +	struct iommu_table *tbl;
>>  
>> -	if (!container->tbl)
>> +	if (!container->grp)
>>  		return -ENXIO;
>>  
>> -	if (!current->mm)
>> -		return -ESRCH; /* process exited */
>> -
>>  	if (container->enabled)
>>  		return -EBUSY;
>>  
>> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * as this information is only available from KVM and VFIO is
>>  	 * KVM agnostic.
>>  	 */
>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>> +	iommu = iommu_group_get_iommudata(container->grp);
>> +	if (!iommu)
>> +		return -EFAULT;
>> +
>> +	tbl = &iommu->tables[0];
> 
> 
> There should probably be a comment somewhere documenting that tables[0]
> is the small window and presumably [1] will be the DDW.

Rather than a comment, how about an enum?


Alex

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
@ 2015-02-04 13:32       ` Alexander Graf
  0 siblings, 0 replies; 70+ messages in thread
From: Alexander Graf @ 2015-02-04 13:32 UTC (permalink / raw)
  To: Alex Williamson, Alexey Kardashevskiy
  Cc: Gavin Shan, linux-kernel, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev



On 03.02.15 01:12, Alex Williamson wrote:
> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
>> for TCE tables. Right now just one table is supported.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>>  arch/powerpc/kernel/eeh.c                   |   2 +-
>>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>>  9 files changed, 170 insertions(+), 83 deletions(-)
> [snip]
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 29d5708..28909e1 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>>   */
>>  struct tce_container {
>>  	struct mutex lock;
>> -	struct iommu_table *tbl;
>> +	struct iommu_group *grp;
>>  	bool enabled;
>>  };
>>  
>> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>  	return false;
>>  }
>>  
>> +static struct iommu_table *spapr_tce_find_table(
>> +		struct tce_container *container,
>> +		phys_addr_t ioba)
>> +{
>> +	long i;
>> +	struct iommu_table *ret = NULL;
>> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
>> +
>> +	mutex_lock(&container->lock);
>> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
>> +		struct iommu_table *tbl = &iommu->tables[i];
>> +		unsigned long entry = ioba >> tbl->it_page_shift;
>> +		unsigned long start = tbl->it_offset;
>> +		unsigned long end = start + tbl->it_size;
>> +
>> +		if ((start <= entry) && (entry < end)) {
>> +			ret = tbl;
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&container->lock);
>> +
>> +	return ret;
>> +}
>> +
>>  static int tce_iommu_enable(struct tce_container *container)
>>  {
>>  	int ret = 0;
>> +	struct powerpc_iommu *iommu;
>> +	struct iommu_table *tbl;
>>  
>> -	if (!container->tbl)
>> +	if (!container->grp)
>>  		return -ENXIO;
>>  
>> -	if (!current->mm)
>> -		return -ESRCH; /* process exited */
>> -
>>  	if (container->enabled)
>>  		return -EBUSY;
>>  
>> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * as this information is only available from KVM and VFIO is
>>  	 * KVM agnostic.
>>  	 */
>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>> +	iommu = iommu_group_get_iommudata(container->grp);
>> +	if (!iommu)
>> +		return -EFAULT;
>> +
>> +	tbl = &iommu->tables[0];
> 
> 
> There should probably be a comment somewhere documenting that tables[0]
> is the small window and presumably [1] will be the DDW.

Rather than a comment, how about an enum?


Alex

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
  2015-02-04  6:08     ` Paul Mackerras
@ 2015-02-05  4:57       ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-05  4:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Michael Ellerman,
	Gavin Shan, Alex Williamson, Alexander Graf, Alexander Gordeev,
	linux-kernel

On 02/04/2015 05:08 PM, Paul Mackerras wrote:
> On Thu, Jan 29, 2015 at 08:21:53PM +1100, Alexey Kardashevskiy wrote:
>> At the moment writing new TCE value to the IOMMU table fails with EBUSY
>> if there is a valid entry already. However PAPR specification allows
>> the guest to write new TCE value without clearing it first.
>>
>> Another problem this patch is addressing is the use of pool locks for
>> external IOMMU users such as VFIO. The pool locks are to protect
>> DMA page allocator rather than entries and since the host kernel does
>> not control what pages are in use, there is no point in pool locks and
>> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>>
>> This adds an exchange() callback to iommu_table_ops which does the same
>> thing as set() plus it returns replaced TCE(s) so the caller can release
>> the pages afterwards.
>>
>> This implements exchange() for IODA2 only. This adds a requirement
>> for a platform to have exchange() implemented so from now on IODA2 is
>> the only supported PHB for VFIO-SPAPR.
>>
>> This replaces iommu_tce_build() and iommu_clear_tce() with
>> a single iommu_tce_xchg().
> 
> [snip]
> 
>> @@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
>>  
>>  		hva = (unsigned long) page_address(page) +
>>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>> +		oldtce = 0;
>>  
>> -		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
>> +		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
> 
> Is the change from entry + 1 to entry + i here an actual bug fix?
> If so please mention it in the patch description.

This patch added the bug:
[PATCH v3 01/24] vfio: powerpc/spapr: Move page pinning from arch code to
VFIO IOMMU driver

Will fix in the next try.


-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE
@ 2015-02-05  4:57       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-05  4:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Gavin Shan, Alexander Graf, Alex Williamson, Alexander Gordeev,
	linuxppc-dev, linux-kernel

On 02/04/2015 05:08 PM, Paul Mackerras wrote:
> On Thu, Jan 29, 2015 at 08:21:53PM +1100, Alexey Kardashevskiy wrote:
>> At the moment writing new TCE value to the IOMMU table fails with EBUSY
>> if there is a valid entry already. However PAPR specification allows
>> the guest to write new TCE value without clearing it first.
>>
>> Another problem this patch is addressing is the use of pool locks for
>> external IOMMU users such as VFIO. The pool locks are to protect
>> DMA page allocator rather than entries and since the host kernel does
>> not control what pages are in use, there is no point in pool locks and
>> exchange()+put_page(oldtce) is sufficient to avoid possible races.
>>
>> This adds an exchange() callback to iommu_table_ops which does the same
>> thing as set() plus it returns replaced TCE(s) so the caller can release
>> the pages afterwards.
>>
>> This implements exchange() for IODA2 only. This adds a requirement
>> for a platform to have exchange() implemented so from now on IODA2 is
>> the only supported PHB for VFIO-SPAPR.
>>
>> This replaces iommu_tce_build() and iommu_clear_tce() with
>> a single iommu_tce_xchg().
> 
> [snip]
> 
>> @@ -294,8 +303,9 @@ static long tce_iommu_build(struct tce_container *container,
>>  
>>  		hva = (unsigned long) page_address(page) +
>>  			(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>> +		oldtce = 0;
>>  
>> -		ret = iommu_tce_build(tbl, entry + 1, hva, direction);
>> +		ret = iommu_tce_xchg(tbl, entry + i, hva, &oldtce, direction);
> 
> Is the change from entry + 1 to entry + i here an actual bug fix?
> If so please mention it in the patch description.

This patch added the bug:
[PATCH v3 01/24] vfio: powerpc/spapr: Move page pinning from arch code to
VFIO IOMMU driver

Will fix in the next try.


-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
  2015-02-04 13:32       ` Alexander Graf
@ 2015-02-05  4:58         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-05  4:58 UTC (permalink / raw)
  To: Alexander Graf, Alex Williamson
  Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Gavin Shan, Alexander Gordeev, linux-kernel

On 02/05/2015 12:32 AM, Alexander Graf wrote:
> 
> 
> On 03.02.15 01:12, Alex Williamson wrote:
>> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>>> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
>>> for TCE tables. Right now just one table is supported.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>>>  arch/powerpc/kernel/eeh.c                   |   2 +-
>>>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>>>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>>>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>>>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>>>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>>>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>>>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>>>  9 files changed, 170 insertions(+), 83 deletions(-)
>> [snip]
>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> index 29d5708..28909e1 100644
>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>>>   */
>>>  struct tce_container {
>>>  	struct mutex lock;
>>> -	struct iommu_table *tbl;
>>> +	struct iommu_group *grp;
>>>  	bool enabled;
>>>  };
>>>  
>>> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>>  	return false;
>>>  }
>>>  
>>> +static struct iommu_table *spapr_tce_find_table(
>>> +		struct tce_container *container,
>>> +		phys_addr_t ioba)
>>> +{
>>> +	long i;
>>> +	struct iommu_table *ret = NULL;
>>> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
>>> +
>>> +	mutex_lock(&container->lock);
>>> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
>>> +		struct iommu_table *tbl = &iommu->tables[i];
>>> +		unsigned long entry = ioba >> tbl->it_page_shift;
>>> +		unsigned long start = tbl->it_offset;
>>> +		unsigned long end = start + tbl->it_size;
>>> +
>>> +		if ((start <= entry) && (entry < end)) {
>>> +			ret = tbl;
>>> +			break;
>>> +		}
>>> +	}
>>> +	mutex_unlock(&container->lock);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static int tce_iommu_enable(struct tce_container *container)
>>>  {
>>>  	int ret = 0;
>>> +	struct powerpc_iommu *iommu;
>>> +	struct iommu_table *tbl;
>>>  
>>> -	if (!container->tbl)
>>> +	if (!container->grp)
>>>  		return -ENXIO;
>>>  
>>> -	if (!current->mm)
>>> -		return -ESRCH; /* process exited */
>>> -
>>>  	if (container->enabled)
>>>  		return -EBUSY;
>>>  
>>> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>>>  	 * as this information is only available from KVM and VFIO is
>>>  	 * KVM agnostic.
>>>  	 */
>>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>>> +	iommu = iommu_group_get_iommudata(container->grp);
>>> +	if (!iommu)
>>> +		return -EFAULT;
>>> +
>>> +	tbl = &iommu->tables[0];
>>
>>
>> There should probably be a comment somewhere documenting that tables[0]
>> is the small window and presumably [1] will be the DDW.
> 
> Rather than a comment, how about an enum?


[0] could be DDW if the guest decides to remove the default window and
create one huge in its place - older guests (sles11sp3) did that (but they
could not cope with the huge window starting from zero), newer guests do
not try removing the default window but they might want to do this later.

So I am not so sure what kind of comment would be good here...



-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu
@ 2015-02-05  4:58         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 70+ messages in thread
From: Alexey Kardashevskiy @ 2015-02-05  4:58 UTC (permalink / raw)
  To: Alexander Graf, Alex Williamson
  Cc: Gavin Shan, linux-kernel, Alexander Gordeev, Paul Mackerras,
	linuxppc-dev

On 02/05/2015 12:32 AM, Alexander Graf wrote:
> 
> 
> On 03.02.15 01:12, Alex Williamson wrote:
>> On Thu, 2015-01-29 at 20:21 +1100, Alexey Kardashevskiy wrote:
>>> Modern IBM POWERPC systems support multiple (currently two) TCE tables
>>> per IOMMU group (a.k.a. PE). This adds a powerpc_iommu container
>>> for TCE tables. Right now just one table is supported.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>  arch/powerpc/include/asm/iommu.h            |  18 ++--
>>>  arch/powerpc/kernel/eeh.c                   |   2 +-
>>>  arch/powerpc/kernel/iommu.c                 |  34 ++++----
>>>  arch/powerpc/platforms/powernv/pci-ioda.c   |  37 +++++---
>>>  arch/powerpc/platforms/powernv/pci-p5ioc2.c |  16 ++--
>>>  arch/powerpc/platforms/powernv/pci.c        |   2 +-
>>>  arch/powerpc/platforms/powernv/pci.h        |   4 +-
>>>  arch/powerpc/platforms/pseries/iommu.c      |   9 +-
>>>  drivers/vfio/vfio_iommu_spapr_tce.c         | 131 ++++++++++++++++++++--------
>>>  9 files changed, 170 insertions(+), 83 deletions(-)
>> [snip]
>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> index 29d5708..28909e1 100644
>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> @@ -84,7 +84,7 @@ static void decrement_locked_vm(long npages)
>>>   */
>>>  struct tce_container {
>>>  	struct mutex lock;
>>> -	struct iommu_table *tbl;
>>> +	struct iommu_group *grp;
>>>  	bool enabled;
>>>  };
>>>  
>>> @@ -104,16 +104,40 @@ static bool tce_check_page_size(struct page *page, unsigned page_shift)
>>>  	return false;
>>>  }
>>>  
>>> +static struct iommu_table *spapr_tce_find_table(
>>> +		struct tce_container *container,
>>> +		phys_addr_t ioba)
>>> +{
>>> +	long i;
>>> +	struct iommu_table *ret = NULL;
>>> +	struct powerpc_iommu *iommu = iommu_group_get_iommudata(container->grp);
>>> +
>>> +	mutex_lock(&container->lock);
>>> +	for (i = 0; i < POWERPC_IOMMU_MAX_TABLES; ++i) {
>>> +		struct iommu_table *tbl = &iommu->tables[i];
>>> +		unsigned long entry = ioba >> tbl->it_page_shift;
>>> +		unsigned long start = tbl->it_offset;
>>> +		unsigned long end = start + tbl->it_size;
>>> +
>>> +		if ((start <= entry) && (entry < end)) {
>>> +			ret = tbl;
>>> +			break;
>>> +		}
>>> +	}
>>> +	mutex_unlock(&container->lock);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static int tce_iommu_enable(struct tce_container *container)
>>>  {
>>>  	int ret = 0;
>>> +	struct powerpc_iommu *iommu;
>>> +	struct iommu_table *tbl;
>>>  
>>> -	if (!container->tbl)
>>> +	if (!container->grp)
>>>  		return -ENXIO;
>>>  
>>> -	if (!current->mm)
>>> -		return -ESRCH; /* process exited */
>>> -
>>>  	if (container->enabled)
>>>  		return -EBUSY;
>>>  
>>> @@ -142,7 +166,12 @@ static int tce_iommu_enable(struct tce_container *container)
>>>  	 * as this information is only available from KVM and VFIO is
>>>  	 * KVM agnostic.
>>>  	 */
>>> -	ret = try_increment_locked_vm(IOMMU_TABLE_PAGES(container->tbl));
>>> +	iommu = iommu_group_get_iommudata(container->grp);
>>> +	if (!iommu)
>>> +		return -EFAULT;
>>> +
>>> +	tbl = &iommu->tables[0];
>>
>>
>> There should probably be a comment somewhere documenting that tables[0]
>> is the small window and presumably [1] will be the DDW.
> 
> Rather than a comment, how about an enum?


[0] could be DDW if the guest decides to remove the default window and
create one huge in its place - older guests (sles11sp3) did that (but they
could not cope with the huge window starting from zero), newer guests do
not try removing the default window but they might want to do this later.

So I am not so sure what kind of comment would be good here...



-- 
Alexey

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2015-02-05  4:59 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29  9:21 [PATCH v3 00/24] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
2015-01-29  9:21 ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 01/24] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 02/24] vfio: powerpc/iommu: Check that TCE page size is equal to it_page_size Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-02-02 21:45   ` Alex Williamson
2015-02-02 21:45     ` Alex Williamson
2015-01-29  9:21 ` [PATCH v3 03/24] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 04/24] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 05/24] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-02-03  0:12   ` Alex Williamson
2015-02-03  0:12     ` Alex Williamson
2015-01-29  9:21 ` [PATCH v3 06/24] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 07/24] powerpc/iommu: Introduce iommu_table_alloc() helper Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 08/24] powerpc/spapr: vfio: Switch from iommu_table to new powerpc_iommu Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-02-03  0:12   ` Alex Williamson
2015-02-03  0:12     ` Alex Williamson
2015-02-04 13:32     ` Alexander Graf
2015-02-04 13:32       ` Alexander Graf
2015-02-05  4:58       ` Alexey Kardashevskiy
2015-02-05  4:58         ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 09/24] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 10/24] powerpc/powernv/ioda2: Rework IOMMU ownership control Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 11/24] powerpc/powernv/ioda/ioda2: Rework tce_build()/tce_free() Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 12/24] powerpc/iommu/powernv: Release replaced TCE Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-02-04  6:08   ` Paul Mackerras
2015-02-04  6:08     ` Paul Mackerras
2015-02-05  4:57     ` Alexey Kardashevskiy
2015-02-05  4:57       ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 13/24] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 14/24] vfio: powerpc/spapr: Register memory Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-02-03  0:11   ` Alex Williamson
2015-02-03  0:11     ` Alex Williamson
2015-02-03  5:51     ` Alexey Kardashevskiy
2015-02-03  5:51       ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 15/24] poweppc/powernv/ioda2: Rework iommu_table creation Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 16/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_create_table Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 17/24] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:21 ` [PATCH v3 18/24] powerpc/iommu: Split iommu_free_table into 2 helpers Alexey Kardashevskiy
2015-01-29  9:21   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 19/24] powerpc/powernv: Implement multilevel TCE tables Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 20/24] powerpc/powernv: Change prototypes to receive iommu Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 21/24] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 22/24] powerpc/iommu: Get rid of ownership helpers Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 23/24] vfio/spapr: Enable multiple groups in a container Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-01-29  9:22 ` [PATCH v3 24/24] vfio: powerpc/spapr: Support Dynamic DMA windows Alexey Kardashevskiy
2015-01-29  9:22   ` Alexey Kardashevskiy
2015-02-03  2:53   ` Alex Williamson
2015-02-03  2:53     ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.