linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW)
@ 2014-07-24  8:47 Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 01/18] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
                   ` (17 more replies)
  0 siblings, 18 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This prepares existing upstream kernel for DDW (Dynamic DMA windows) and
adds actual DDW support for VFIO.

This patchset does not contain any in-kernel acceleration stuff.

This patchset does not enable DDW for emulated devices.

Changes:
v3:
* applied multiple comments from Gavin regarding error checking
and callbacks placements

v2:
* moved "Account TCE pages in locked_vm" here (was in later series)
* added counting for huge window to locked_vm (ugly but better than nothing)
* fixed bug with missing >>PAGE_SHIFT when calling pfn_to_page



Alexey Kardashevskiy (18):
  powerpc/iommu: Fix comments with it_page_shift
  KVM: PPC: Use RCU when adding to arch.spapr_tce_tables
  KVM: PPC: Account TCE pages in locked_vm
  vfio: powerpc: Move locked_vm accounting to a helper
  powerpc/powernv: Use it_page_shift for TCE invalidation
  powerpc/powernv: Use it_page_shift in TCE build
  powerpc/powernv: Add a page size parameter to
    pnv_pci_setup_iommu_table()
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
    take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/iommu: Fix missing permission bits in
    iommu_put_tce_user_mode()
  powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values
  powerpc/powernv: Return non-zero TCE from pnv_tce_build
  powerpc/iommu: Implement put_page() if TCE had non-zero value
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: Use it_page_size
  vfio: powerpc: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h            |   7 +-
 arch/powerpc/include/asm/machdep.h          |   2 +
 arch/powerpc/include/asm/tce.h              |  38 +++
 arch/powerpc/kernel/iommu.c                 |  98 +++++---
 arch/powerpc/kvm/book3s_64_vio.c            |  37 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 255 ++++++++++++++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c        |  64 +++--
 arch/powerpc/platforms/powernv/pci.h        |  17 +-
 arch/powerpc/platforms/pseries/iommu.c      |  17 +-
 arch/powerpc/sysdev/dart_iommu.c            |   1 +
 drivers/vfio/vfio_iommu_spapr_tce.c         | 365 ++++++++++++++++++++++++----
 include/uapi/linux/vfio.h                   |  37 ++-
 13 files changed, 797 insertions(+), 145 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/18] powerpc/iommu: Fix comments with it_page_shift
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables Alexey Kardashevskiy
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

There is a couple of commented debug prints which still use
IOMMU_PAGE_SHIFT() which is not defined for POWERPC anymore, replace
them with it_page_shift.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 88e3ec6..f84f799 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1037,7 +1037,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
-			__func__, hwaddr, entry << IOMMU_PAGE_SHIFT(tbl),
+			__func__, hwaddr, entry << tbl->it_page_shift,
 				hwaddr, ret); */
 
 	return ret;
@@ -1056,7 +1056,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 			direction != DMA_TO_DEVICE, &page);
 	if (unlikely(ret != 1)) {
 		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << IOMMU_PAGE_SHIFT(tbl), ret); */
+				tce, entry << tbl->it_page_shift, ret); */
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 01/18] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-28  0:40   ` Benjamin Herrenschmidt
  2014-07-24  8:47 ` [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm Alexey Kardashevskiy
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_64_vio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 54cf9bc..516f2ee 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
-	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 01/18] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-28  0:43   ` Benjamin Herrenschmidt
  2014-07-24  8:47 ` [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kvm/book3s_64_vio.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 516f2ee..48b7ed4 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
 		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
+/*
+ * Checks ulimit in order not to let the user space to pin all
+ * available memory for TCE tables.
+ */
+static long kvmppc_account_memlimit(long npages)
+{
+	unsigned long ret = 0, locked, lock_limit;
+
+	if (!current->mm)
+		return -ESRCH; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+				rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+
 static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 {
 	struct kvm *kvm = stt->kvm;
 	int i;
+	long npages = kvmppc_stt_npages(stt->window_size);
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+	for (i = 0; i < npages; i++)
 		__free_page(stt->pages[i]);
+
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
+	kvmppc_account_memlimit(-(npages + 1));
+
 	kvm_put_kvm(kvm);
 }
 
@@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	}
 
 	npages = kvmppc_stt_npages(args->window_size);
+	ret = kvmppc_account_memlimit(npages + 1);
+	if (ret)
+		goto fail;
 
 	stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
 		      GFP_KERNEL);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-07-24  8:47 ` [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-28  0:46   ` Benjamin Herrenschmidt
  2014-07-24  8:47 ` [PATCH v3 05/18] powerpc/powernv: Use it_page_shift for TCE invalidation Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Additional DMA windows support is coming and accounting will include them
so let's move this code to a helper for reuse.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 54 ++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a84788b..c8f7284 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,10 +47,40 @@ struct tce_container {
 	bool enabled;
 };
 
+/*
+ * Checks ulimit in order not to let the user space to pin all
+ * available memory for TCE tables.
+ */
+static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
+{
+	unsigned long ret = 0, locked, lock_limit;
+	long npages;
+
+	if (!current->mm)
+		return -ESRCH; /* process exited */
+
+	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	if (!inc)
+		npages = -npages;
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+				rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked, lock_limit, npages;
 	struct iommu_table *tbl = container->tbl;
 
 	if (!container->tbl)
@@ -80,20 +110,11 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
 	 */
-	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	locked = current->mm->locked_vm + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-				rlimit(RLIMIT_MEMLOCK));
-		ret = -ENOMEM;
-	} else {
+	ret = tce_iommu_account_memlimit(tbl, true);
+	if (ret)
+		return ret;
 
-		current->mm->locked_vm += npages;
-		container->enabled = true;
-	}
-	up_write(&current->mm->mmap_sem);
+	container->enabled = true;
 
 	return ret;
 }
@@ -108,10 +129,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!container->tbl || !current->mm)
 		return;
 
-	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (container->tbl->it_size <<
-			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	up_write(&current->mm->mmap_sem);
+	tce_iommu_account_memlimit(container->tbl, false);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/18] powerpc/powernv: Use it_page_shift for TCE invalidation
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-07-24  8:47 ` [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-24  8:47 ` [PATCH v3 06/18] powerpc/powernv: Use it_page_shift in TCE build Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This fixes IODA1/2 to use it_page_shift as it may be bigger than 4K.

This changes the involved constant values to use "ull" modifier.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index de19ede..40f968e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -513,15 +513,16 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
 		(__be64 __iomem *)tbl->it_index;
 	unsigned long start, end, inc;
+	const unsigned shift = tbl->it_page_shift;
 
 	start = __pa(startp);
 	end = __pa(endp);
 
 	/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
 	if (tbl->it_busno) {
-		start <<= 12;
-		end <<= 12;
-		inc = 128 << 12;
+		start <<= shift;
+		end <<= shift;
+		inc = 128ull << shift;
 		start |= tbl->it_busno;
 		end |= tbl->it_busno;
 	} else if (tbl->it_type & TCE_PCI_SWINV_PAIR) {
@@ -559,18 +560,19 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	__be64 __iomem *invalidate = rm ?
 		(__be64 __iomem *)pe->tce_inval_reg_phys :
 		(__be64 __iomem *)tbl->it_index;
+	const unsigned shift = tbl->it_page_shift;
 
 	/* We'll invalidate DMA address in PE scope */
-	start = 0x2ul << 60;
+	start = 0x2ull << 60;
 	start |= (pe->pe_number & 0xFF);
 	end = start;
 
 	/* Figure out the start, end and step */
 	inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
-	start |= (inc << 12);
+	start |= (inc << shift);
 	inc = tbl->it_offset + (((u64)endp - tbl->it_base) / sizeof(u64));
-	end |= (inc << 12);
-	inc = (0x1ul << 12);
+	end |= (inc << shift);
+	inc = (0x1ull << shift);
 	mb();
 
 	while (start <= end) {
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/18] powerpc/powernv: Use it_page_shift in TCE build
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-07-24  8:47 ` [PATCH v3 05/18] powerpc/powernv: Use it_page_shift for TCE invalidation Alexey Kardashevskiy
@ 2014-07-24  8:47 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 07/18] powerpc/powernv: Add a page size parameter to pnv_pci_setup_iommu_table() Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:47 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This makes use of iommu_table::it_page_shift instead of TCE_SHIFT and
TCE_RPN_SHIFT hardcoded values.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index f91a4e5..b6cb996 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -564,10 +564,11 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		proto_tce |= TCE_PCI_WRITE;
 
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
-	rpn = __pa(uaddr) >> TCE_SHIFT;
+	rpn = __pa(uaddr) >> tbl->it_page_shift;
 
 	while (npages--)
-		*(tcep++) = cpu_to_be64(proto_tce | (rpn++ << TCE_RPN_SHIFT));
+		*(tcep++) = cpu_to_be64(proto_tce |
+				(rpn++ << tbl->it_page_shift));
 
 	/* Some implementations won't cache invalid TCEs and thus may not
 	 * need that flush. We'll probably turn it_type into a bit mask
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/18] powerpc/powernv: Add a page size parameter to pnv_pci_setup_iommu_table()
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-07-24  8:47 ` [PATCH v3 06/18] powerpc/powernv: Use it_page_shift in TCE build Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 08/18] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Since a TCE page size can be other than 4K, make it configurable for
P5IOC2 and IODA PHBs.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c   | 5 +++--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++-
 arch/powerpc/platforms/powernv/pci.c        | 6 +++---
 arch/powerpc/platforms/powernv/pci.h        | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 40f968e..9f28e18 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -656,7 +656,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	/* Setup linux iommu table */
 	tbl = &pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
-				  base << 28);
+				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
 	/* OPAL variant of P7IOC SW invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
@@ -786,7 +786,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 
 	/* Setup linux iommu table */
 	tbl = &pe->tce32_table;
-	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0);
+	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+			IOMMU_PAGE_SHIFT_4K);
 
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index e3807d6..94ce348 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -172,7 +172,8 @@ static void __init pnv_pci_init_p5ioc2_phb(struct device_node *np, u64 hub_id,
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_p5ioc2_dma_dev_setup;
 	pnv_pci_setup_iommu_table(&phb->p5ioc2.iommu_table,
-				  tce_mem, tce_size, 0);
+				  tce_mem, tce_size, 0,
+				  IOMMU_PAGE_SHIFT_4K);
 }
 
 void __init pnv_pci_init_p5ioc2_hub(struct device_node *np)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b6cb996..4dff552 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,11 +628,11 @@ static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
-			       u64 dma_offset)
+			       u64 dma_offset, unsigned page_shift)
 {
 	tbl->it_blocksize = 16;
 	tbl->it_base = (unsigned long)tce_mem;
-	tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
+	tbl->it_page_shift = page_shift;
 	tbl->it_offset = dma_offset >> tbl->it_page_shift;
 	tbl->it_index = 0;
 	tbl->it_size = tce_size >> 3;
@@ -657,7 +657,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	if (WARN_ON(!tbl))
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
-				  be32_to_cpup(sizep), 0);
+				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	iommu_init_table(tbl, hose->node);
 	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 676232c..6f5ff69 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -198,7 +198,7 @@ int pnv_pci_cfg_write(struct device_node *dn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-				      u64 dma_offset);
+				      u64 dma_offset, unsigned page_shift);
 extern void pnv_pci_init_p5ioc2_hub(struct device_node *np);
 extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/18] powerpc/powernv: Make invalidate() a callback
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 07/18] powerpc/powernv: Add a page size parameter to pnv_pci_setup_iommu_table() Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 09/18] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++--------------------
 arch/powerpc/platforms/powernv/pci.c      | 31 +++++++++++++++++++++--------
 arch/powerpc/platforms/powernv/pci.h      | 13 +++++++++++-
 3 files changed, 47 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9f28e18..007497f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -462,7 +462,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base(&pdev->dev, &pe->tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -489,7 +489,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, &pe->tce32.table);
 	}
 	return 0;
 }
@@ -499,7 +499,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		set_iommu_table_base_and_group(&dev->dev, &pe->tce32_table);
+		set_iommu_table_base_and_group(&dev->dev, &pe->tce32.table);
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
 	}
@@ -584,19 +584,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 __be64 *startp, __be64 *endp, bool rm)
-{
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
-	struct pnv_phb *phb = pe->phb;
-
-	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -654,9 +641,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce_invalidate = pnv_pci_ioda1_tce_invalidate;
 
 	/* OPAL variant of P7IOC SW invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
@@ -693,7 +682,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
 	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -734,10 +723,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -785,9 +774,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce_invalidate = pnv_pci_ioda2_tce_invalidate;
 
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4dff552..74a2626 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -550,6 +550,27 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
+	__be64 *endp, bool rm)
+{
+	struct pnv_iommu_table *ptbl = container_of(tbl,
+			struct pnv_iommu_table, table);
+	struct pnv_ioda_pe *pe = ptbl->pe;
+
+	/*
+	 * Some implementations won't cache invalid TCEs and thus may not
+	 * need that flush. We'll probably turn it_type into a bit mask
+	 * of flags if that becomes the case
+	 */
+	if (!(tbl->it_type & TCE_PCI_SWINV_FREE))
+		return;
+
+	if (!pe || !pe->tce_invalidate)
+		return;
+
+	pe->tce_invalidate(pe, tbl, startp, endp, rm);
+}
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
@@ -570,12 +591,7 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		*(tcep++) = cpu_to_be64(proto_tce |
 				(rpn++ << tbl->it_page_shift));
 
-	/* Some implementations won't cache invalid TCEs and thus may not
-	 * need that flush. We'll probably turn it_type into a bit mask
-	 * of flags if that becomes the case
-	 */
-	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 
 	return 0;
 }
@@ -599,8 +615,7 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
 	while (npages--)
 		*(tcep++) = cpu_to_be64(0);
 
-	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 }
 
 static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6f5ff69..32847a5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -22,6 +22,16 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS		(1 << 1)	/* PE has primary PCI bus	*/
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 
+struct pnv_ioda_pe;
+typedef void (*pnv_invalidate_fn)(struct pnv_ioda_pe *pe,
+		struct iommu_table *tbl,
+		__be64 *startp, __be64 *endp, bool rm);
+
+struct pnv_iommu_table {
+	struct iommu_table	table;
+	struct pnv_ioda_pe	*pe;
+};
+
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
 struct pnv_ioda_pe {
@@ -49,9 +59,10 @@ struct pnv_ioda_pe {
 	unsigned int		dma_weight;
 
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
+	pnv_invalidate_fn	tce_invalidate;
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/18] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 08/18] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

At the moment a get_table() callback is the only one. It returns
an iommu_table for a bus address.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is fixed to use new type.
This changes the tce_container struct to keep iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
	- iommu_table;
	- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |   6 ++
 arch/powerpc/include/asm/tce.h              |  15 ++++
 arch/powerpc/kernel/iommu.c                 |  34 ++++++++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  39 +++++++++-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c        |   2 +-
 arch/powerpc/platforms/pseries/iommu.c      |  10 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 112 +++++++++++++++++++++-------
 8 files changed, 184 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+				 void *iommu_owner,
+				 struct spapr_tce_iommu_ops *ops,
 				 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+					void *iommu_owner,
+					struct spapr_tce_iommu_ops *ops,
 					int pci_domain_number,
 					unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..8bfe98f 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,20 @@
 #define TCE_PCI_READ		0x1		/* read from PCI allowed */
 #define TCE_VB_WRITE		0x1		/* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+#define TCE_DEFAULT_WINDOW	~(0ULL)
+
+struct spapr_tce_iommu_ops {
+	struct iommu_table *(*get_table)(
+			struct spapr_tce_iommu_group *data,
+			phys_addr_t addr);
+};
+
+struct spapr_tce_iommu_group {
+	void *iommu_owner;
+	struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f84f799..e203314 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -877,24 +877,52 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
  */
 static void group_release(void *iommu_data)
 {
-	struct iommu_table *tbl = iommu_data;
-	tbl->it_group = NULL;
+	kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_get_default_table(
+		struct spapr_tce_iommu_group *data, phys_addr_t addr)
+{
+	struct iommu_table *tbl = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return tbl;
+
+	if ((addr >> tbl->it_page_shift) < tbl->it_size)
+		return tbl;
+
+	return NULL;
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+	.get_table = spapr_tce_get_default_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+		void *iommu_owner, struct spapr_tce_iommu_ops *ops,
 		int pci_domain_number, unsigned long pe_num)
 {
 	struct iommu_group *grp;
 	char *name;
+	struct spapr_tce_iommu_group *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return;
+
+	data->iommu_owner = iommu_owner ? iommu_owner : tbl;
+	data->ops = ops ? ops : &spapr_tce_default_ops;
 
 	grp = iommu_group_alloc();
 	if (IS_ERR(grp)) {
 		pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
 				PTR_ERR(grp));
+		kfree(data);
 		return;
 	}
+
 	tbl->it_group = grp;
-	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu_group_set_iommudata(grp, data, group_release);
 	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			pci_domain_number, pe_num);
 	if (!name)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 007497f..495137b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include <linux/io.h>
 #include <linux/msi.h>
 #include <linux/memblock.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -584,6 +585,34 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
+static bool pnv_pci_ioda_check_addr(struct iommu_table *tbl, __u64 start_addr)
+{
+	unsigned long entry = start_addr >> tbl->it_page_shift;
+	unsigned long start = tbl->it_offset;
+	unsigned long end = start + tbl->it_size;
+
+	return (start <= entry) && (entry < end);
+}
+
+static struct iommu_table *pnv_ioda1_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		phys_addr_t addr)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return &pe->tce32.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce32.table, addr))
+		return &pe->tce32.table;
+
+	return NULL;
+}
+
+static struct spapr_tce_iommu_ops pnv_pci_ioda1_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -663,7 +692,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -729,6 +759,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
+static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -794,7 +828,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 94ce348..b79066d 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -89,6 +89,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
+				NULL, NULL,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 74a2626..cc54e3b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -674,7 +674,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	iommu_init_table(tbl, hose->node);
-	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 33b552f..a047754 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -616,7 +616,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 	iommu_table_setparms(pci->phb, dn, tbl);
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(tbl, pci_domain_nr(bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -661,7 +661,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(bus), 0);
+		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
 }
@@ -688,7 +688,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
-		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
 					       PCI_DN(dn)->iommu_table);
 		return;
@@ -1104,7 +1105,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
 	} else {
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index c8f7284..917c854 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -43,7 +43,7 @@ static void tce_iommu_detach_group(void *iommu_data,
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_table *tbl;
+	struct iommu_group *grp;
 	bool enabled;
 };
 
@@ -81,9 +81,14 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	struct iommu_table *tbl = container->tbl;
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	if (!container->tbl)
+	if (!container->grp)
+		return -ENXIO;
+
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return -ENXIO;
 
 	if (!current->mm)
@@ -110,6 +115,10 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
 	 */
+	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+	if (!tbl)
+		return -ENXIO;
+
 	ret = tce_iommu_account_memlimit(tbl, true);
 	if (ret)
 		return ret;
@@ -121,15 +130,26 @@ static int tce_iommu_enable(struct tce_container *container)
 
 static void tce_iommu_disable(struct tce_container *container)
 {
+	struct spapr_tce_iommu_group *data;
+	struct iommu_table *tbl;
+
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->tbl || !current->mm)
+	if (!container->grp || !current->mm)
 		return;
 
-	tce_iommu_account_memlimit(container->tbl, false);
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
+		return;
+
+	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+	if (!tbl)
+		return;
+
+	tce_iommu_account_memlimit(tbl, false);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -154,11 +174,11 @@ static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
 
-	WARN_ON(container->tbl && !container->tbl->it_group);
+	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->tbl && container->tbl->it_group)
-		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+	if (container->grp)
+		tce_iommu_detach_group(iommu_data, container->grp);
 
 	mutex_destroy(&container->lock);
 
@@ -178,8 +198,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
 		if (WARN_ON(!tbl))
 			return -ENXIO;
 
@@ -203,13 +232,16 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 		unsigned long tce, i;
 
-		if (!tbl)
+		if (WARN_ON(!container->grp))
 			return -ENXIO;
 
-		BUG_ON(!tbl->it_group);
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -234,6 +266,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
 			tce |= TCE_PCI_WRITE;
 
+		tbl = data->ops->get_table(data, param.iova);
+		if (!tbl)
+			return -ENXIO;
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_put_param_check(tbl, param.iova, tce);
 		if (ret)
 			return ret;
@@ -256,9 +293,14 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_UNMAP_DMA: {
 		struct vfio_iommu_type1_dma_unmap param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
@@ -277,6 +319,12 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.size & ~IOMMU_PAGE_MASK_4K)
 			return -EINVAL;
 
+		tbl = data->ops->get_table(data, param.iova);
+		if (WARN_ON(!tbl))
+			return -ENXIO;
+
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
 				param.size >> IOMMU_PAGE_SHIFT_4K);
 		if (ret)
@@ -311,16 +359,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 {
 	int ret;
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->tbl) {
+	if (container->grp) {
 		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(container->grp),
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else if (container->enabled) {
@@ -328,9 +376,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else {
+		data = iommu_group_get_iommudata(iommu_group);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+		BUG_ON(!tbl);
+
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
-			container->tbl = tbl;
+			container->grp = iommu_group;
 	}
 
 	mutex_unlock(&container->lock);
@@ -342,24 +397,31 @@ static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
-	if (tbl != container->tbl) {
+	if (iommu_group != container->grp) {
 		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
 				iommu_group_id(iommu_group),
-				iommu_group_id(tbl->it_group));
+				iommu_group_id(container->grp));
 	} else {
 		if (container->enabled) {
 			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(tbl->it_group));
+					iommu_group_id(container->grp));
 			tce_iommu_disable(container);
 		}
 
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
-		container->tbl = NULL;
+		container->grp = NULL;
+
+		data = iommu_group_get_iommudata(iommu_group);
+		BUG_ON(!data || !data->iommu_owner || !data->ops);
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+		BUG_ON(!tbl);
+
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 09/18] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-28  1:18   ` Benjamin Herrenschmidt
  2014-07-24  8:48 ` [PATCH v3 11/18] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |  1 -
 arch/powerpc/include/asm/tce.h            |  2 ++
 arch/powerpc/kernel/iommu.c               | 12 ------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++++++++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
 5 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
-	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 8bfe98f..5ee4987 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
 	struct iommu_table *(*get_table)(
 			struct spapr_tce_iommu_group *data,
 			phys_addr_t addr);
+	void (*take_ownership)(struct spapr_tce_iommu_group *data,
+			bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index e203314..06984d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
-	/*
-	 * Disable iommu bypass, otherwise the user can DMA to all of
-	 * our physical memory via the bypass window instead of just
-	 * the pages that has been explicitly mapped into the iommu
-	 */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, false);
-
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
-
-	/* The kernel owns the device now, we can restore the iommu bypass */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 495137b..f828c57 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	/* Install set_bypass callback for VFIO */
-	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
+	pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+				     bool enable)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
 	.get_table = pnv_ioda1_iommu_get_table,
+	.take_ownership = pnv_ioda2_take_ownership,
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 917c854..05b2f11 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -78,6 +78,13 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
 	return ret;
 }
 
+static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
+		bool enable)
+{
+	if (data && data->ops && data->ops->take_ownership)
+		data->ops->take_ownership(data, enable);
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
@@ -386,6 +393,12 @@ static int tce_iommu_attach_group(void *iommu_data,
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
 			container->grp = iommu_group;
+		/*
+		 * Disable iommu bypass, otherwise the user can DMA to all of
+		 * our physical memory via the bypass window instead of just
+		 * the pages that has been explicitly mapped into the iommu
+		 */
+		tce_iommu_take_ownership_notify(data, true);
 	}
 
 	mutex_unlock(&container->lock);
@@ -423,6 +436,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 		BUG_ON(!tbl);
 
 		iommu_release_ownership(tbl);
+
+		/* Kernel owns the device now, we can restore bypass */
+		tce_iommu_take_ownership_notify(data, false);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 11/18] powerpc/iommu: Fix IOMMU ownership control functions
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode() Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 06984d5..0cda2e8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1103,33 +1103,56 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0, bit0 = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+		bit0 = test_and_clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (bit0)
+			set_bit(0, tbl->it_map);
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+	if (!ret)
+		iommu_clear_tces_and_put_pages(tbl, tbl->it_offset,
+				tbl->it_size);
 
-	return 0;
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
+
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 11/18] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-28  1:19   ` Benjamin Herrenschmidt
  2014-07-24  8:48 ` [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This adds missing permission bits to the translated TCE.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0cda2e8..5af2319 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
+	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
 
 	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
 	if (ret)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode() Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-28  1:09   ` Benjamin Herrenschmidt
  2014-07-28  1:16   ` Benjamin Herrenschmidt
  2014-07-24  8:48 ` [PATCH v3 14/18] powerpc/powernv: Return non-zero TCE from pnv_tce_build Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

The tce_build/tce_build_rm callbacks are used to implement H_PUT_TCE/etc
hypercalls. The PAPR spec does not allow to fail if the TCE is not empty.
However we cannot just overwrite the existing TCE value with the new one
as we still have to do page counting.

This adds an optional @old_tces return parameter. If it is not NULL,
it must point to an array of @npages size where the callbacks will
store old TCE values. Since tce_build receives virtual addresses,
the old_tces array will contain virtual addresses as well.

As this patch is mechanical, no change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h     |  2 ++
 arch/powerpc/kernel/iommu.c            |  8 +++++---
 arch/powerpc/platforms/powernv/pci.c   | 13 ++++++++-----
 arch/powerpc/platforms/pseries/iommu.c |  7 +++++--
 arch/powerpc/sysdev/dart_iommu.c       |  1 +
 5 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index f92b0b5..f11596c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -69,6 +69,7 @@ struct machdep_calls {
 				     long index,
 				     long npages,
 				     unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs);
 	void		(*tce_free)(struct iommu_table *tbl,
@@ -83,6 +84,7 @@ struct machdep_calls {
 				     long index,
 				     long npages,
 				     unsigned long uaddr,
+				     long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs);
 	void		(*tce_free_rm)(struct iommu_table *tbl,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 5af2319..ccf7510 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -324,7 +324,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	/* Put the TCEs in the HW table */
 	build_fail = ppc_md.tce_build(tbl, entry, npages,
 				      (unsigned long)page &
-				      IOMMU_PAGE_MASK(tbl), direction, attrs);
+				      IOMMU_PAGE_MASK(tbl), NULL, direction,
+				      attrs);
 
 	/* ppc_md.tce_build() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
@@ -497,7 +498,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		/* Insert into HW table */
 		build_fail = ppc_md.tce_build(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
-					      direction, attrs);
+					      NULL, direction, attrs);
 		if(unlikely(build_fail))
 			goto failure;
 
@@ -1059,7 +1060,8 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 	oldtce = ppc_md.tce_get(tbl, entry);
 	/* Add new entry if it is not busy */
 	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
+				direction, NULL);
 
 	spin_unlock(&(pool->lock));
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index cc54e3b..4b764c2 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -572,7 +572,8 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
 }
 
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
-			 unsigned long uaddr, enum dma_data_direction direction,
+			 unsigned long uaddr, unsigned long *old_tces,
+			 enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
 {
 	u64 proto_tce;
@@ -597,12 +598,12 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 }
 
 static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
+			    unsigned long uaddr, unsigned long *old_tces,
 			    enum dma_data_direction direction,
 			    struct dma_attrs *attrs)
 {
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
-			false);
+	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
+			attrs, false);
 }
 
 static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
@@ -630,10 +631,12 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 
 static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
 			    unsigned long uaddr,
+			    long *old_tces,
 			    enum dma_data_direction direction,
 			    struct dma_attrs *attrs)
 {
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
+	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
+			attrs, true);
 }
 
 static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a047754..6c70b7c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -82,6 +82,7 @@ static void tce_invalidate_pSeries_sw(struct iommu_table *tbl,
 
 static int tce_build_pSeries(struct iommu_table *tbl, long index,
 			      long npages, unsigned long uaddr,
+			      unsigned long *old_tces,
 			      enum dma_data_direction direction,
 			      struct dma_attrs *attrs)
 {
@@ -138,6 +139,7 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
+				unsigned long *old_tces,
 				enum dma_data_direction direction,
 				struct dma_attrs *attrs)
 {
@@ -181,6 +183,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
 
 static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs)
 {
@@ -195,7 +198,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	if (npages == 1) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
-		                           direction, attrs);
+					   old_tces, direction, attrs);
 	}
 
 	local_irq_save(flags);	/* to protect tcep and the page behind it */
@@ -211,7 +214,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
-					    direction, attrs);
+					    old_tces, direction, attrs);
 		}
 		__get_cpu_var(tce_page) = tcep;
 	}
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..0d3cf7c 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -162,6 +162,7 @@ static void dart_flush(struct iommu_table *tbl)
 
 static int dart_build(struct iommu_table *tbl, long index,
 		       long npages, unsigned long uaddr,
+		       unsigned long *old_tces,
 		       enum dma_data_direction direction,
 		       struct dma_attrs *attrs)
 {
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 14/18] powerpc/powernv: Return non-zero TCE from pnv_tce_build
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 15/18] powerpc/iommu: Implement put_page() if TCE had non-zero value Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This returns old TCE values to the caller if requested.
The caller is expectded to call put_page() for them.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4b764c2..164d653 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -579,6 +579,7 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	u64 proto_tce;
 	__be64 *tcep, *tces;
 	u64 rpn;
+	long i;
 
 	proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -588,9 +589,13 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
 	rpn = __pa(uaddr) >> tbl->it_page_shift;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(proto_tce |
-				(rpn++ << tbl->it_page_shift));
+	for (i = 0; i < npages; i++) {
+		unsigned long oldtce = xchg(tcep, cpu_to_be64(proto_tce |
+				(rpn++ << tbl->it_page_shift)));
+		if (old_tces)
+			old_tces[i] = (unsigned long) __va(oldtce);
+		tcep++;
+	}
 
 	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 15/18] powerpc/iommu: Implement put_page() if TCE had non-zero value
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 14/18] powerpc/powernv: Return non-zero TCE from pnv_tce_build Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 16/18] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

Guests might put new TCEs without clearing them first and the PAPR spec
allows that.

This adds put_page() for TCEs which we just replaced.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ccf7510..f8bf641 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1057,11 +1057,11 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
-				direction, NULL);
+	ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, &oldtce,
+			direction, NULL);
+
+	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
+		put_page(pfn_to_page(__pa(oldtce) >> PAGE_SHIFT));
 
 	spin_unlock(&(pool->lock));
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 16/18] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 15/18] powerpc/iommu: Implement put_page() if TCE had non-zero value Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 17/18] vfio: Use it_page_size Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 18/18] vfio: powerpc: Enable Dynamic DMA windows Alexey Kardashevskiy
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

SPAPR defines an interface to create additional DMA windows dynamically.
"Dynamically" means that the window is not allocated at the guest start
and the guest can request it later. In practice, existing linux guests
check for the capability and if it is there, they create+map one big DMA
window as big as the entire guest RAM.

SPAPR defines 4 RTAS calls for this feature which userspace implements.
This adds 4 callbacks into the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);
2. create - ibm,create-pe-dma-window - creates a window;
3. remove - ibm,remove-pe-dma-window - removes a window; only additional
window created by create() can be removed, the default 32bit window cannot
be removed as guests do not expect new windows to start from zero;
4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; now it only removes the additional window if it
was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
pass RTAS call from the userspace to the IODA code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/tce.h            |  21 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 157 +++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.h      |   2 +
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 5ee4987..583463b 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -60,6 +60,27 @@ struct spapr_tce_iommu_ops {
 			phys_addr_t addr);
 	void (*take_ownership)(struct spapr_tce_iommu_group *data,
 			bool enable);
+
+	/* Dynamic DMA window */
+	/* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K       0x01
+#define DDW_PGSIZE_64K      0x02
+#define DDW_PGSIZE_16M      0x04
+#define DDW_PGSIZE_32M      0x08
+#define DDW_PGSIZE_64M      0x10
+#define DDW_PGSIZE_128M     0x20
+#define DDW_PGSIZE_256M     0x40
+#define DDW_PGSIZE_16G      0x80
+	long (*query)(struct spapr_tce_iommu_group *data,
+			__u32 *windows_available,
+			__u32 *page_size_mask);
+	long (*create)(struct spapr_tce_iommu_group *data,
+			__u32 page_shift,
+			__u32 window_shift,
+			struct iommu_table **ptbl);
+	long (*remove)(struct spapr_tce_iommu_group *data,
+			struct iommu_table *tbl);
+	long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f828c57..2f2bdab 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -754,6 +754,24 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		phys_addr_t addr)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return &pe->tce32.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce64.table, addr))
+		return &pe->tce64.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce32.table, addr))
+		return &pe->tce32.table;
+
+	return NULL;
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 				     bool enable)
 {
@@ -762,9 +780,146 @@ static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+		__u32 *windows_available, __u32 *page_size_mask)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (pe->tce64_active) {
+		*page_size_mask = 0;
+		*windows_available = 0;
+	} else {
+		*page_size_mask =
+			DDW_PGSIZE_4K |
+			DDW_PGSIZE_64K |
+			DDW_PGSIZE_16M;
+		*windows_available = 1;
+	}
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+		__u32 page_shift, __u32 window_shift,
+		struct iommu_table **ptbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	struct page *tce_mem = NULL;
+	void *addr;
+	long ret;
+	unsigned long tce_table_size =
+			(1ULL << (window_shift - page_shift)) * 8;
+	unsigned order;
+	struct iommu_table *tbl64 = &pe->tce64.table;
+
+	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
+		return -EINVAL;
+
+	if (window_shift > (memory_hotplug_max() >> page_shift))
+		return -EINVAL;
+
+	if (pe->tce64_active)
+		return -EBUSY;
+
+	tce_table_size = max(0x1000UL, tce_table_size);
+	order = get_order(tce_table_size);
+
+	pe_info(pe, "Setting up DDW at %llx..%llx ws=0x%x ps=0x%x table_size=0x%lx order=0x%x\n",
+			pe->tce_bypass_base,
+			pe->tce_bypass_base + (1ULL << window_shift) - 1,
+			window_shift, page_shift, tce_table_size, order);
+
+	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pe_err(pe, " Failed to allocate a DDW\n");
+		return -EFAULT;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, tce_table_size);
+
+	/* Configure HW */
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1, /* Window number */
+			1,
+			__pa(addr),
+			tce_table_size,
+			1 << page_shift);
+	if (ret) {
+		pe_err(pe, " Failed to configure 32-bit TCE table, err %ld\n",
+				ret);
+		return -EFAULT;
+	}
+
+	/* Setup linux iommu table */
+	pnv_pci_setup_iommu_table(tbl64, addr, tce_table_size,
+			pe->tce_bypass_base, page_shift);
+	pe->tce64.pe = pe;
+
+	/* Copy "invalidate" register address */
+	tbl64->it_index = pe->tce32.table.it_index;
+	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
+			TCE_PCI_SWINV_PAIR;
+	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
+
+	*ptbl = &pe->tce64.table;
+
+	pe->tce64_active = true;
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_remove(struct spapr_tce_iommu_group *data,
+		struct iommu_table *tbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	long ret;
+
+	/* Only additional 64bit window removal is supported */
+	if ((tbl != &pe->tce64.table) || !pe->tce64_active)
+		return -EFAULT;
+
+	pe_info(pe, "Removing huge 64bit DMA window\n");
+
+	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	pe->tce64_active = false;
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+	memset(&pe->tce64, 0, sizeof(pe->tce64));
+
+	return ret;
+}
+
+static long pnv_pci_ioda2_ddw_reset(struct spapr_tce_iommu_group *data)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pe_info(pe, "Reset DMA windows\n");
+
+	if (!pe->tce64_active)
+		return 0;
+
+	return pnv_pci_ioda2_ddw_remove(data, &pe->tce64.table);
+}
+
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
-	.get_table = pnv_ioda1_iommu_get_table,
+	.get_table = pnv_ioda2_iommu_get_table,
 	.take_ownership = pnv_ioda2_take_ownership,
+	.query = pnv_pci_ioda2_ddw_query,
+	.create = pnv_pci_ioda2_ddw_create,
+	.remove = pnv_pci_ioda2_ddw_remove,
+	.reset = pnv_pci_ioda2_ddw_reset
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 32847a5..7e88d8a 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -64,6 +64,8 @@ struct pnv_ioda_pe {
 	int			tce32_segcount;
 	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
+	bool			tce64_active;
+	struct pnv_iommu_table	tce64;
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 17/18] vfio: Use it_page_size
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (15 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 16/18] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  2014-07-24  8:48 ` [PATCH v3 18/18] vfio: powerpc: Enable Dynamic DMA windows Alexey Kardashevskiy
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 05b2f11..2deab4a 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -59,7 +59,7 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
 	if (!current->mm)
 		return -ESRCH; /* process exited */
 
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+	npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 	if (!inc)
 		npages = -npages;
 
@@ -228,8 +228,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -282,17 +282,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
+		for (i = 0; i < (param.size >> tbl->it_page_shift); ++i) {
 			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
+					(param.iova >> tbl->it_page_shift) + i,
 					tce);
 			if (ret)
 				break;
-			tce += IOMMU_PAGE_SIZE_4K;
+			tce += IOMMU_PAGE_SIZE(tbl);
 		}
 		if (ret)
 			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+					param.iova >> tbl->it_page_shift, i);
 
 		iommu_flush_tce(tbl);
 
@@ -333,13 +333,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 		BUG_ON(!tbl->it_group);
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.size >> tbl->it_page_shift);
 		if (ret)
 			return ret;
 
 		ret = iommu_clear_tces_and_put_pages(tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				param.size >> tbl->it_page_shift);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 18/18] vfio: powerpc: Enable Dynamic DMA windows
  2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (16 preceding siblings ...)
  2014-07-24  8:48 ` [PATCH v3 17/18] vfio: Use it_page_size Alexey Kardashevskiy
@ 2014-07-24  8:48 ` Alexey Kardashevskiy
  17 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-24  8:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Paul Mackerras, Gavin Shan

This defines and implements VFIO IOMMU API required to support
Dynamic DMA windows defined in the SPAPR specification. The ioctl handlers
implement host-size part of corresponding RTAS calls:
- VFIO_IOMMU_SPAPR_TCE_QUERY - ibm,query-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_CREATE - ibm,create-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_REMOVE - ibm,remove-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_RESET - ibm,reset-pe-dma-window.

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls reset() when IOMMU is being disabled (happens when VFIO stops
using it).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   1 +
 drivers/vfio/vfio_iommu_spapr_tce.c       | 169 +++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h                 |  37 ++++++-
 3 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2f2bdab..abd97be 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -859,6 +859,7 @@ static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
 
 	/* Copy "invalidate" register address */
 	tbl64->it_index = pe->tce32.table.it_index;
+	tbl64->it_group = pe->tce32.table.it_group;
 	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
 			TCE_PCI_SWINV_PAIR;
 	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2deab4a..20d34b8 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -45,6 +45,7 @@ struct tce_container {
 	struct mutex lock;
 	struct iommu_group *grp;
 	bool enabled;
+	unsigned long start64;
 };
 
 /*
@@ -145,18 +146,34 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	container->enabled = false;
 
-	if (!container->grp || !current->mm)
+	if (!container->grp)
 		return;
 
 	data = iommu_group_get_iommudata(container->grp);
 	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return;
 
+	/* Try resetting, there might have been a 64bit window */
+	if (data->ops->reset)
+		data->ops->reset(data);
+
+	if (!current->mm)
+		return;
+
 	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
 	if (!tbl)
 		return;
 
 	tce_iommu_account_memlimit(tbl, false);
+
+	if (!container->start64)
+		return;
+
+	tbl = data->ops->get_table(data, container->start64);
+	if (!tbl)
+		return;
+
+	tce_iommu_account_memlimit(tbl, false);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -231,6 +248,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
 		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
+		if (data->ops->query && data->ops->create && data->ops->remove)
+			info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -356,6 +375,154 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
+
+	case VFIO_IOMMU_SPAPR_TCE_QUERY: {
+		struct vfio_iommu_spapr_tce_query query;
+		struct spapr_tce_iommu_group *data;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_query,
+				page_size_mask);
+
+		if (copy_from_user(&query, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (query.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->query || !data->iommu_owner)
+			return -ENOSYS;
+
+		ret = data->ops->query(data,
+				&query.windows_available,
+				&query.page_size_mask);
+
+		if (ret)
+			return ret;
+
+		if (copy_to_user((void __user *)arg, &query, minsz))
+			return -EFAULT;
+
+		return 0;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+		struct vfio_iommu_spapr_tce_create create;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+				start_addr);
+
+		if (copy_from_user(&create, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (create.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->create || !data->iommu_owner)
+			return -ENOSYS;
+
+		BUG_ON(!data || !data->ops || !data->ops->remove);
+
+		ret = data->ops->create(data, create.page_shift,
+				create.window_shift, &tbl);
+		if (ret)
+			return ret;
+
+		ret = tce_iommu_account_memlimit(tbl, true);
+		if (ret) {
+			data->ops->remove(data, tbl);
+			return ret;
+		}
+
+		create.start_addr = tbl->it_offset << tbl->it_page_shift;
+
+		if (copy_to_user((void __user *)arg, &create, minsz)) {
+			data->ops->remove(data, tbl);
+			tce_iommu_account_memlimit(tbl, false);
+			return -EFAULT;
+		}
+
+		return ret;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+		struct vfio_iommu_spapr_tce_remove remove;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+				start_addr);
+
+		if (copy_from_user(&remove, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (remove.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->remove || !data->iommu_owner)
+			return -ENOSYS;
+
+		tbl = data->ops->get_table(data, remove.start_addr);
+		if (!tbl)
+			return -EINVAL;
+
+		ret = data->ops->remove(data, tbl);
+		if (ret)
+			return ret;
+
+		tce_iommu_account_memlimit(tbl, false);
+
+		return 0;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_RESET: {
+		struct vfio_iommu_spapr_tce_reset reset;
+		struct spapr_tce_iommu_group *data;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_reset, argsz);
+
+		if (copy_from_user(&reset, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (reset.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->reset || !data->iommu_owner)
+			return -ENOSYS;
+
+		ret = data->ops->reset(data);
+		if (ret)
+			return ret;
+
+		if (container->start64) {
+			struct iommu_table *tbl;
+
+			tbl = data->ops->get_table(data, container->start64);
+			BUG_ON(!tbl);
+
+			tce_iommu_account_memlimit(tbl, false);
+		}
+
+		return 0;
+	}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..8b03381 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -448,13 +448,48 @@ struct vfio_iommu_type1_dma_unmap {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * Dynamic DMA windows
+ */
+struct vfio_iommu_spapr_tce_query {
+	__u32 argsz;
+	/* out */
+	__u32 windows_available;
+	__u32 page_size_mask;
+};
+#define VFIO_IOMMU_SPAPR_TCE_QUERY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	/* out */
+	__u64 start_addr;
+
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+struct vfio_iommu_spapr_tce_reset {
+	__u32 argsz;
+};
+#define VFIO_IOMMU_SPAPR_TCE_RESET	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables
  2014-07-24  8:47 ` [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables Alexey Kardashevskiy
@ 2014-07-28  0:40   ` Benjamin Herrenschmidt
  2014-07-28  4:11     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  0:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

You need a much better changeset here, explaining what you are
protecting against and why just changing that one list_add is
sufficient.

Cheers,
Ben.

> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 54cf9bc..516f2ee 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	kvm_get_kvm(kvm);
>  
>  	mutex_lock(&kvm->lock);
> -	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> +	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
>  
>  	mutex_unlock(&kvm->lock);
>  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm
  2014-07-24  8:47 ` [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm Alexey Kardashevskiy
@ 2014-07-28  0:43   ` Benjamin Herrenschmidt
  2014-07-28  4:23     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  0:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---

You need a description.

>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 516f2ee..48b7ed4 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
>  		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> +/*
> + * Checks ulimit in order not to let the user space to pin all
> + * available memory for TCE tables.
> + */
> +static long kvmppc_account_memlimit(long npages)
> +{
> +	unsigned long ret = 0, locked, lock_limit;
> +
> +	if (!current->mm)
> +		return -ESRCH; /* process exited */
> +
> +	down_write(&current->mm->mmap_sem);
> +	locked = current->mm->locked_vm + npages;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> +				rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +	} else {
> +		current->mm->locked_vm += npages;
> +	}
> +	up_write(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  {
>  	struct kvm *kvm = stt->kvm;
>  	int i;
> +	long npages = kvmppc_stt_npages(stt->window_size);
>  
>  	mutex_lock(&kvm->lock);
>  	list_del(&stt->list);
> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> +	for (i = 0; i < npages; i++)
>  		__free_page(stt->pages[i]);
> +
>  	kfree(stt);
>  	mutex_unlock(&kvm->lock);
>  
> +	kvmppc_account_memlimit(-(npages + 1));
> +
>  	kvm_put_kvm(kvm);
>  }
>  
> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	}
>  
>  	npages = kvmppc_stt_npages(args->window_size);
> +	ret = kvmppc_account_memlimit(npages + 1);
> +	if (ret)
> +		goto fail;

This is called for VFIO only or is it also called when creating TCE
tables for emulated devices ? Because in the latter case, you don't
want to account the pages as locked, do you ?

Also, you need to explain what +1

Finally, do I correctly deduce that creating 10 TCE tables of 2G
each will end up accounting 20G as locked even if the guest for
example only has 4G of RAM ? 

>  	stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
>  		      GFP_KERNEL);

Ben.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper
  2014-07-24  8:47 ` [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper Alexey Kardashevskiy
@ 2014-07-28  0:46   ` Benjamin Herrenschmidt
  2014-07-28  9:12     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  0:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Additional DMA windows support is coming and accounting will include them
> so let's move this code to a helper for reuse.

This code looks a LOT like the one you added in the previous patch
(ie. kvmppc_account_memlimit()), though in a different place. I
wonder if we should re-arrange things so that there is really
only one version of it..

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 54 ++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a84788b..c8f7284 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -47,10 +47,40 @@ struct tce_container {
>  	bool enabled;
>  };
>  
> +/*
> + * Checks ulimit in order not to let the user space to pin all
> + * available memory for TCE tables.
> + */
> +static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
> +{
> +	unsigned long ret = 0, locked, lock_limit;
> +	long npages;
> +
> +	if (!current->mm)
> +		return -ESRCH; /* process exited */
> +
> +	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> +	if (!inc)
> +		npages = -npages;
> +
> +	down_write(&current->mm->mmap_sem);
> +	locked = current->mm->locked_vm + npages;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> +				rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +	} else {
> +		current->mm->locked_vm += npages;
> +	}
> +	up_write(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> -	unsigned long locked, lock_limit, npages;
>  	struct iommu_table *tbl = container->tbl;
>  
>  	if (!container->tbl)
> @@ -80,20 +110,11 @@ static int tce_iommu_enable(struct tce_container *container)
>  	 * that would effectively kill the guest at random points, much better
>  	 * enforcing the limit based on the max that the guest can map.
>  	 */
> -	down_write(&current->mm->mmap_sem);
> -	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> -	locked = current->mm->locked_vm + npages;
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> -		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> -				rlimit(RLIMIT_MEMLOCK));
> -		ret = -ENOMEM;
> -	} else {
> +	ret = tce_iommu_account_memlimit(tbl, true);
> +	if (ret)
> +		return ret;
>  
> -		current->mm->locked_vm += npages;
> -		container->enabled = true;
> -	}
> -	up_write(&current->mm->mmap_sem);
> +	container->enabled = true;
>  
>  	return ret;
>  }
> @@ -108,10 +129,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  	if (!container->tbl || !current->mm)
>  		return;
>  
> -	down_write(&current->mm->mmap_sem);
> -	current->mm->locked_vm -= (container->tbl->it_size <<
> -			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> -	up_write(&current->mm->mmap_sem);
> +	tce_iommu_account_memlimit(container->tbl, false);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values
  2014-07-24  8:48 ` [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values Alexey Kardashevskiy
@ 2014-07-28  1:09   ` Benjamin Herrenschmidt
  2014-07-28  1:16   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  1:09 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> The tce_build/tce_build_rm callbacks are used to implement H_PUT_TCE/etc
> hypercalls. The PAPR spec does not allow to fail if the TCE is not empty.
> However we cannot just overwrite the existing TCE value with the new one
> as we still have to do page counting.
> 
> This adds an optional @old_tces return parameter. If it is not NULL,
> it must point to an array of @npages size where the callbacks will
> store old TCE values. Since tce_build receives virtual addresses,
> the old_tces array will contain virtual addresses as well.
> 
> As this patch is mechanical, no change in behaviour is expected.

You missed a few iommu's in the conversion ... pasemi, cell, ...

Ben.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h     |  2 ++
>  arch/powerpc/kernel/iommu.c            |  8 +++++---
>  arch/powerpc/platforms/powernv/pci.c   | 13 ++++++++-----
>  arch/powerpc/platforms/pseries/iommu.c |  7 +++++--
>  arch/powerpc/sysdev/dart_iommu.c       |  1 +
>  5 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index f92b0b5..f11596c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -69,6 +69,7 @@ struct machdep_calls {
>  				     long index,
>  				     long npages,
>  				     unsigned long uaddr,
> +				     unsigned long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs);
>  	void		(*tce_free)(struct iommu_table *tbl,
> @@ -83,6 +84,7 @@ struct machdep_calls {
>  				     long index,
>  				     long npages,
>  				     unsigned long uaddr,
> +				     long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs);
>  	void		(*tce_free_rm)(struct iommu_table *tbl,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5af2319..ccf7510 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -324,7 +324,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
>  	/* Put the TCEs in the HW table */
>  	build_fail = ppc_md.tce_build(tbl, entry, npages,
>  				      (unsigned long)page &
> -				      IOMMU_PAGE_MASK(tbl), direction, attrs);
> +				      IOMMU_PAGE_MASK(tbl), NULL, direction,
> +				      attrs);
>  
>  	/* ppc_md.tce_build() only returns non-zero for transient errors.
>  	 * Clean up the table bitmap in this case and return
> @@ -497,7 +498,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
>  		/* Insert into HW table */
>  		build_fail = ppc_md.tce_build(tbl, entry, npages,
>  					      vaddr & IOMMU_PAGE_MASK(tbl),
> -					      direction, attrs);
> +					      NULL, direction, attrs);
>  		if(unlikely(build_fail))
>  			goto failure;
>  
> @@ -1059,7 +1060,8 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>  	oldtce = ppc_md.tce_get(tbl, entry);
>  	/* Add new entry if it is not busy */
>  	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> -		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
> +		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
> +				direction, NULL);
>  
>  	spin_unlock(&(pool->lock));
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index cc54e3b..4b764c2 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -572,7 +572,8 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
>  }
>  
>  static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> -			 unsigned long uaddr, enum dma_data_direction direction,
> +			 unsigned long uaddr, unsigned long *old_tces,
> +			 enum dma_data_direction direction,
>  			 struct dma_attrs *attrs, bool rm)
>  {
>  	u64 proto_tce;
> @@ -597,12 +598,12 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>  }
>  
>  static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> -			    unsigned long uaddr,
> +			    unsigned long uaddr, unsigned long *old_tces,
>  			    enum dma_data_direction direction,
>  			    struct dma_attrs *attrs)
>  {
> -	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
> -			false);
> +	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
> +			attrs, false);
>  }
>  
>  static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
> @@ -630,10 +631,12 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>  
>  static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
>  			    unsigned long uaddr,
> +			    long *old_tces,
>  			    enum dma_data_direction direction,
>  			    struct dma_attrs *attrs)
>  {
> -	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
> +	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
> +			attrs, true);
>  }
>  
>  static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index a047754..6c70b7c 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -82,6 +82,7 @@ static void tce_invalidate_pSeries_sw(struct iommu_table *tbl,
>  
>  static int tce_build_pSeries(struct iommu_table *tbl, long index,
>  			      long npages, unsigned long uaddr,
> +			      unsigned long *old_tces,
>  			      enum dma_data_direction direction,
>  			      struct dma_attrs *attrs)
>  {
> @@ -138,6 +139,7 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
>  
>  static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				long npages, unsigned long uaddr,
> +				unsigned long *old_tces,
>  				enum dma_data_direction direction,
>  				struct dma_attrs *attrs)
>  {
> @@ -181,6 +183,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
>  
>  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				     long npages, unsigned long uaddr,
> +				     unsigned long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs)
>  {
> @@ -195,7 +198,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  
>  	if (npages == 1) {
>  		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> -		                           direction, attrs);
> +					   old_tces, direction, attrs);
>  	}
>  
>  	local_irq_save(flags);	/* to protect tcep and the page behind it */
> @@ -211,7 +214,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  		if (!tcep) {
>  			local_irq_restore(flags);
>  			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> -					    direction, attrs);
> +					    old_tces, direction, attrs);
>  		}
>  		__get_cpu_var(tce_page) = tcep;
>  	}
> diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
> index 9e5353f..0d3cf7c 100644
> --- a/arch/powerpc/sysdev/dart_iommu.c
> +++ b/arch/powerpc/sysdev/dart_iommu.c
> @@ -162,6 +162,7 @@ static void dart_flush(struct iommu_table *tbl)
>  
>  static int dart_build(struct iommu_table *tbl, long index,
>  		       long npages, unsigned long uaddr,
> +		       unsigned long *old_tces,
>  		       enum dma_data_direction direction,
>  		       struct dma_attrs *attrs)
>  {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values
  2014-07-24  8:48 ` [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values Alexey Kardashevskiy
  2014-07-28  1:09   ` Benjamin Herrenschmidt
@ 2014-07-28  1:16   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  1:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> The tce_build/tce_build_rm callbacks are used to implement H_PUT_TCE/etc
> hypercalls. The PAPR spec does not allow to fail if the TCE is not empty.
> However we cannot just overwrite the existing TCE value with the new one
> as we still have to do page counting.
> 
> This adds an optional @old_tces return parameter. If it is not NULL,
> it must point to an array of @npages size where the callbacks will
> store old TCE values. Since tce_build receives virtual addresses,
> the old_tces array will contain virtual addresses as well.
> 
> As this patch is mechanical, no change in behaviour is expected.

Yeah well ... you added the new argument but you never actually populate
it. Don't do that, make that one single patch.

Also this is broken in many other ways. You don't implement it for all
iommu's which is not a reasonable interface. On the other hand,
implementing for all of them would have a significant cost since on
pseries It would mean another H-call.

So I'd suggest you actually create a separate callback, something like
tce_get_and_set() or something...

In fact, since you're going to touch each iommu interface like that you
should probably do what I suggested earlier and move those callbacks
to an iommu_ops structure per iommu.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h     |  2 ++
>  arch/powerpc/kernel/iommu.c            |  8 +++++---
>  arch/powerpc/platforms/powernv/pci.c   | 13 ++++++++-----
>  arch/powerpc/platforms/pseries/iommu.c |  7 +++++--
>  arch/powerpc/sysdev/dart_iommu.c       |  1 +
>  5 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index f92b0b5..f11596c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -69,6 +69,7 @@ struct machdep_calls {
>  				     long index,
>  				     long npages,
>  				     unsigned long uaddr,
> +				     unsigned long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs);
>  	void		(*tce_free)(struct iommu_table *tbl,
> @@ -83,6 +84,7 @@ struct machdep_calls {
>  				     long index,
>  				     long npages,
>  				     unsigned long uaddr,
> +				     long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs);
>  	void		(*tce_free_rm)(struct iommu_table *tbl,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5af2319..ccf7510 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -324,7 +324,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
>  	/* Put the TCEs in the HW table */
>  	build_fail = ppc_md.tce_build(tbl, entry, npages,
>  				      (unsigned long)page &
> -				      IOMMU_PAGE_MASK(tbl), direction, attrs);
> +				      IOMMU_PAGE_MASK(tbl), NULL, direction,
> +				      attrs);
>  
>  	/* ppc_md.tce_build() only returns non-zero for transient errors.
>  	 * Clean up the table bitmap in this case and return
> @@ -497,7 +498,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
>  		/* Insert into HW table */
>  		build_fail = ppc_md.tce_build(tbl, entry, npages,
>  					      vaddr & IOMMU_PAGE_MASK(tbl),
> -					      direction, attrs);
> +					      NULL, direction, attrs);
>  		if(unlikely(build_fail))
>  			goto failure;
>  
> @@ -1059,7 +1060,8 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
>  	oldtce = ppc_md.tce_get(tbl, entry);
>  	/* Add new entry if it is not busy */
>  	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> -		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
> +		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
> +				direction, NULL);
>  
>  	spin_unlock(&(pool->lock));
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index cc54e3b..4b764c2 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -572,7 +572,8 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
>  }
>  
>  static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> -			 unsigned long uaddr, enum dma_data_direction direction,
> +			 unsigned long uaddr, unsigned long *old_tces,
> +			 enum dma_data_direction direction,
>  			 struct dma_attrs *attrs, bool rm)
>  {
>  	u64 proto_tce;
> @@ -597,12 +598,12 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>  }
>  
>  static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> -			    unsigned long uaddr,
> +			    unsigned long uaddr, unsigned long *old_tces,
>  			    enum dma_data_direction direction,
>  			    struct dma_attrs *attrs)
>  {
> -	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
> -			false);
> +	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
> +			attrs, false);
>  }
>  
>  static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
> @@ -630,10 +631,12 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>  
>  static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
>  			    unsigned long uaddr,
> +			    long *old_tces,
>  			    enum dma_data_direction direction,
>  			    struct dma_attrs *attrs)
>  {
> -	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
> +	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
> +			attrs, true);
>  }
>  
>  static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index a047754..6c70b7c 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -82,6 +82,7 @@ static void tce_invalidate_pSeries_sw(struct iommu_table *tbl,
>  
>  static int tce_build_pSeries(struct iommu_table *tbl, long index,
>  			      long npages, unsigned long uaddr,
> +			      unsigned long *old_tces,
>  			      enum dma_data_direction direction,
>  			      struct dma_attrs *attrs)
>  {
> @@ -138,6 +139,7 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
>  
>  static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				long npages, unsigned long uaddr,
> +				unsigned long *old_tces,
>  				enum dma_data_direction direction,
>  				struct dma_attrs *attrs)
>  {
> @@ -181,6 +183,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
>  
>  static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				     long npages, unsigned long uaddr,
> +				     unsigned long *old_tces,
>  				     enum dma_data_direction direction,
>  				     struct dma_attrs *attrs)
>  {
> @@ -195,7 +198,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  
>  	if (npages == 1) {
>  		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> -		                           direction, attrs);
> +					   old_tces, direction, attrs);
>  	}
>  
>  	local_irq_save(flags);	/* to protect tcep and the page behind it */
> @@ -211,7 +214,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  		if (!tcep) {
>  			local_irq_restore(flags);
>  			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> -					    direction, attrs);
> +					    old_tces, direction, attrs);
>  		}
>  		__get_cpu_var(tce_page) = tcep;
>  	}
> diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
> index 9e5353f..0d3cf7c 100644
> --- a/arch/powerpc/sysdev/dart_iommu.c
> +++ b/arch/powerpc/sysdev/dart_iommu.c
> @@ -162,6 +162,7 @@ static void dart_flush(struct iommu_table *tbl)
>  
>  static int dart_build(struct iommu_table *tbl, long index,
>  		       long npages, unsigned long uaddr,
> +		       unsigned long *old_tces,
>  		       enum dma_data_direction direction,
>  		       struct dma_attrs *attrs)
>  {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-07-24  8:48 ` [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
@ 2014-07-28  1:18   ` Benjamin Herrenschmidt
  2014-07-28  3:55     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  1:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> At the moment the iommu_table struct has a set_bypass() which enables/
> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> which calls this callback when external IOMMU users such as VFIO are
> about to get over a PHB.
> 
> Since the set_bypass() is not really an iommu_table function but PE's
> function, and we have an ops struct per IOMMU owner, let's move
> set_bypass() to the spapr_tce_iommu_ops struct.
> 
> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
> has very little to do with PEs, this moves take_ownership() calls to
> the VFIO SPAPR TCE driver.
> 
> This renames set_bypass() to take_ownership() as it is not necessarily
> just enabling bypassing, it can be something else/more so let's give it
> a generic name. The bool parameter is inverted.

I disagree with the name change. take_ownership() is the right semantic
at the high level, but down to the actual table, it *is* about disabling
bypass.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/iommu.h          |  1 -
>  arch/powerpc/include/asm/tce.h            |  2 ++
>  arch/powerpc/kernel/iommu.c               | 12 ------------
>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++++++++++-------
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
>  5 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 84ee339..2b0b01d 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -77,7 +77,6 @@ struct iommu_table {
>  #ifdef CONFIG_IOMMU_API
>  	struct iommu_group *it_group;
>  #endif
> -	void (*set_bypass)(struct iommu_table *tbl, bool enable);
>  };
>  
>  /* Pure 2^n version of get_order */
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index 8bfe98f..5ee4987 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
>  	struct iommu_table *(*get_table)(
>  			struct spapr_tce_iommu_group *data,
>  			phys_addr_t addr);
> +	void (*take_ownership)(struct spapr_tce_iommu_group *data,
> +			bool enable);
>  };
>  
>  struct spapr_tce_iommu_group {
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index e203314..06984d5 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>  	memset(tbl->it_map, 0xff, sz);
>  	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
>  
> -	/*
> -	 * Disable iommu bypass, otherwise the user can DMA to all of
> -	 * our physical memory via the bypass window instead of just
> -	 * the pages that has been explicitly mapped into the iommu
> -	 */
> -	if (tbl->set_bypass)
> -		tbl->set_bypass(tbl, false);
> -
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  	/* Restore bit#0 set by iommu_init_table() */
>  	if (tbl->it_offset == 0)
>  		set_bit(0, tbl->it_map);
> -
> -	/* The kernel owns the device now, we can restore the iommu bypass */
> -	if (tbl->set_bypass)
> -		tbl->set_bypass(tbl, true);
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 495137b..f828c57 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>  		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>  }
>  
> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>  {
> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> -					      tce32.table);
>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>  	int64_t rc;
>  
> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>  	/* TVE #1 is selected by PCI address bit 59 */
>  	pe->tce_bypass_base = 1ull << 59;
>  
> -	/* Install set_bypass callback for VFIO */
> -	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
> -
>  	/* Enable bypass by default */
> -	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
> +	pnv_pci_ioda2_set_bypass(pe, true);
> +}
> +
> +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
> +				     bool enable)
> +{
> +	struct pnv_ioda_pe *pe = data->iommu_owner;
> +
> +	pnv_pci_ioda2_set_bypass(pe, !enable);
>  }
>  
>  static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
>  	.get_table = pnv_ioda1_iommu_get_table,
> +	.take_ownership = pnv_ioda2_take_ownership,
>  };
>  
>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 917c854..05b2f11 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -78,6 +78,13 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
>  	return ret;
>  }
>  
> +static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
> +		bool enable)
> +{
> +	if (data && data->ops && data->ops->take_ownership)
> +		data->ops->take_ownership(data, enable);
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>  	int ret = 0;
> @@ -386,6 +393,12 @@ static int tce_iommu_attach_group(void *iommu_data,
>  		ret = iommu_take_ownership(tbl);
>  		if (!ret)
>  			container->grp = iommu_group;
> +		/*
> +		 * Disable iommu bypass, otherwise the user can DMA to all of
> +		 * our physical memory via the bypass window instead of just
> +		 * the pages that has been explicitly mapped into the iommu
> +		 */
> +		tce_iommu_take_ownership_notify(data, true);
>  	}
>  
>  	mutex_unlock(&container->lock);
> @@ -423,6 +436,9 @@ static void tce_iommu_detach_group(void *iommu_data,
>  		BUG_ON(!tbl);
>  
>  		iommu_release_ownership(tbl);
> +
> +		/* Kernel owns the device now, we can restore bypass */
> +		tce_iommu_take_ownership_notify(data, false);
>  	}
>  	mutex_unlock(&container->lock);
>  }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()
  2014-07-24  8:48 ` [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode() Alexey Kardashevskiy
@ 2014-07-28  1:19   ` Benjamin Herrenschmidt
  2014-07-28  4:32     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  1:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> This adds missing permission bits to the translated TCE.

Is this a bug fix for existing stuff ? If yes, submit it separately.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/kernel/iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 0cda2e8..5af2319 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
>  		return -EFAULT;
>  	}
>  	hwaddr = (unsigned long) page_address(page) + offset;
> +	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
>  
>  	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
>  	if (ret)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-07-28  1:18   ` Benjamin Herrenschmidt
@ 2014-07-28  3:55     ` Alexey Kardashevskiy
  2014-07-28  4:19       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-28  3:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Paul Mackerras, Gavin Shan

On 07/28/2014 11:18 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> At the moment the iommu_table struct has a set_bypass() which enables/
>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>> which calls this callback when external IOMMU users such as VFIO are
>> about to get over a PHB.
>>
>> Since the set_bypass() is not really an iommu_table function but PE's
>> function, and we have an ops struct per IOMMU owner, let's move
>> set_bypass() to the spapr_tce_iommu_ops struct.
>>
>> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
>> has very little to do with PEs, this moves take_ownership() calls to
>> the VFIO SPAPR TCE driver.
>>
>> This renames set_bypass() to take_ownership() as it is not necessarily
>> just enabling bypassing, it can be something else/more so let's give it
>> a generic name. The bool parameter is inverted.
> 
> I disagree with the name change. take_ownership() is the right semantic
> at the high level, but down to the actual table, it *is* about disabling
> bypass.

It is still pnv_pci_ioda2_set_bypass() :-/

take_ownership() is a callback of PNV IOMMU group.

s/take_ownership/set_bypass/ ?


> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/iommu.h          |  1 -
>>  arch/powerpc/include/asm/tce.h            |  2 ++
>>  arch/powerpc/kernel/iommu.c               | 12 ------------
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++++++++++-------
>>  drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
>>  5 files changed, 29 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index 84ee339..2b0b01d 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -77,7 +77,6 @@ struct iommu_table {
>>  #ifdef CONFIG_IOMMU_API
>>  	struct iommu_group *it_group;
>>  #endif
>> -	void (*set_bypass)(struct iommu_table *tbl, bool enable);
>>  };
>>  
>>  /* Pure 2^n version of get_order */
>> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
>> index 8bfe98f..5ee4987 100644
>> --- a/arch/powerpc/include/asm/tce.h
>> +++ b/arch/powerpc/include/asm/tce.h
>> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
>>  	struct iommu_table *(*get_table)(
>>  			struct spapr_tce_iommu_group *data,
>>  			phys_addr_t addr);
>> +	void (*take_ownership)(struct spapr_tce_iommu_group *data,
>> +			bool enable);
>>  };
>>  
>>  struct spapr_tce_iommu_group {
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index e203314..06984d5 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>>  	memset(tbl->it_map, 0xff, sz);
>>  	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
>>  
>> -	/*
>> -	 * Disable iommu bypass, otherwise the user can DMA to all of
>> -	 * our physical memory via the bypass window instead of just
>> -	 * the pages that has been explicitly mapped into the iommu
>> -	 */
>> -	if (tbl->set_bypass)
>> -		tbl->set_bypass(tbl, false);
>> -
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
>> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>>  	/* Restore bit#0 set by iommu_init_table() */
>>  	if (tbl->it_offset == 0)
>>  		set_bit(0, tbl->it_map);
>> -
>> -	/* The kernel owns the device now, we can restore the iommu bypass */
>> -	if (tbl->set_bypass)
>> -		tbl->set_bypass(tbl, true);
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>>  
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 495137b..f828c57 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
>>  		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>>  }
>>  
>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>>  {
>> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -					      tce32.table);
>>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>  	int64_t rc;
>>  
>> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>  	/* TVE #1 is selected by PCI address bit 59 */
>>  	pe->tce_bypass_base = 1ull << 59;
>>  
>> -	/* Install set_bypass callback for VFIO */
>> -	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
>> -
>>  	/* Enable bypass by default */
>> -	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
>> +	pnv_pci_ioda2_set_bypass(pe, true);
>> +}
>> +
>> +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
>> +				     bool enable)
>> +{
>> +	struct pnv_ioda_pe *pe = data->iommu_owner;
>> +
>> +	pnv_pci_ioda2_set_bypass(pe, !enable);
>>  }
>>  
>>  static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
>>  	.get_table = pnv_ioda1_iommu_get_table,
>> +	.take_ownership = pnv_ioda2_take_ownership,
>>  };
>>  
>>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 917c854..05b2f11 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -78,6 +78,13 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
>>  	return ret;
>>  }
>>  
>> +static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
>> +		bool enable)
>> +{
>> +	if (data && data->ops && data->ops->take_ownership)
>> +		data->ops->take_ownership(data, enable);
>> +}
>> +
>>  static int tce_iommu_enable(struct tce_container *container)
>>  {
>>  	int ret = 0;
>> @@ -386,6 +393,12 @@ static int tce_iommu_attach_group(void *iommu_data,
>>  		ret = iommu_take_ownership(tbl);
>>  		if (!ret)
>>  			container->grp = iommu_group;
>> +		/*
>> +		 * Disable iommu bypass, otherwise the user can DMA to all of
>> +		 * our physical memory via the bypass window instead of just
>> +		 * the pages that has been explicitly mapped into the iommu
>> +		 */
>> +		tce_iommu_take_ownership_notify(data, true);
>>  	}
>>  
>>  	mutex_unlock(&container->lock);
>> @@ -423,6 +436,9 @@ static void tce_iommu_detach_group(void *iommu_data,
>>  		BUG_ON(!tbl);
>>  
>>  		iommu_release_ownership(tbl);
>> +
>> +		/* Kernel owns the device now, we can restore bypass */
>> +		tce_iommu_take_ownership_notify(data, false);
>>  	}
>>  	mutex_unlock(&container->lock);
>>  }
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables
  2014-07-28  0:40   ` Benjamin Herrenschmidt
@ 2014-07-28  4:11     ` Alexey Kardashevskiy
  2014-07-28  4:30       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-28  4:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On 07/28/2014 10:40 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> 
> You need a much better changeset here, explaining what you are
> protecting against and why just changing that one list_add is
> sufficient.



===
At the moment the spapr_tce_tables list is not protected against races
which may happen if the userspace issues the KVM_CREATE_SPAPR_TCE ioctl
to KVM from different threads.

This makes use of _rcu helpers for list_add()/list_del().
===

and it is missing this bit as well (was under impression that all tables
get freed at once at KVM exit but this might not be the case for malicious
guest):

--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -211,7 +211,7 @@ static void release_spapr_tce_table(struct
kvmppc_spapr_tce_table *stt)
        long npages = kvmppc_stt_npages(stt->size);

        mutex_lock(&kvm->lock);
-       list_del(&stt->list);
+       list_del_rcu(&stt->list);



Would this be sufficient?



> 
> Cheers,
> Ben.
> 
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 54cf9bc..516f2ee 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	kvm_get_kvm(kvm);
>>  
>>  	mutex_lock(&kvm->lock);
>> -	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
>> +	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
>>  
>>  	mutex_unlock(&kvm->lock);
>>  
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-07-28  3:55     ` Alexey Kardashevskiy
@ 2014-07-28  4:19       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  4:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Paul Mackerras, Gavin Shan

On Mon, 2014-07-28 at 13:55 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 11:18 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> >> At the moment the iommu_table struct has a set_bypass() which enables/
> >> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >> which calls this callback when external IOMMU users such as VFIO are
> >> about to get over a PHB.
> >>
> >> Since the set_bypass() is not really an iommu_table function but PE's
> >> function, and we have an ops struct per IOMMU owner, let's move
> >> set_bypass() to the spapr_tce_iommu_ops struct.
> >>
> >> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
> >> has very little to do with PEs, this moves take_ownership() calls to
> >> the VFIO SPAPR TCE driver.
> >>
> >> This renames set_bypass() to take_ownership() as it is not necessarily
> >> just enabling bypassing, it can be something else/more so let's give it
> >> a generic name. The bool parameter is inverted.
> > 
> > I disagree with the name change. take_ownership() is the right semantic
> > at the high level, but down to the actual table, it *is* about disabling
> > bypass.
> 
> It is still pnv_pci_ioda2_set_bypass() :-/
> 
> take_ownership() is a callback of PNV IOMMU group.
> 
> s/take_ownership/set_bypass/ ?
> 

Hrm, ... ok. It's a bit messy but should do for now.

> > 
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/iommu.h          |  1 -
> >>  arch/powerpc/include/asm/tce.h            |  2 ++
> >>  arch/powerpc/kernel/iommu.c               | 12 ------------
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++++++++++-------
> >>  drivers/vfio/vfio_iommu_spapr_tce.c       | 16 ++++++++++++++++
> >>  5 files changed, 29 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >> index 84ee339..2b0b01d 100644
> >> --- a/arch/powerpc/include/asm/iommu.h
> >> +++ b/arch/powerpc/include/asm/iommu.h
> >> @@ -77,7 +77,6 @@ struct iommu_table {
> >>  #ifdef CONFIG_IOMMU_API
> >>  	struct iommu_group *it_group;
> >>  #endif
> >> -	void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >>  };
> >>  
> >>  /* Pure 2^n version of get_order */
> >> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> >> index 8bfe98f..5ee4987 100644
> >> --- a/arch/powerpc/include/asm/tce.h
> >> +++ b/arch/powerpc/include/asm/tce.h
> >> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
> >>  	struct iommu_table *(*get_table)(
> >>  			struct spapr_tce_iommu_group *data,
> >>  			phys_addr_t addr);
> >> +	void (*take_ownership)(struct spapr_tce_iommu_group *data,
> >> +			bool enable);
> >>  };
> >>  
> >>  struct spapr_tce_iommu_group {
> >> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >> index e203314..06984d5 100644
> >> --- a/arch/powerpc/kernel/iommu.c
> >> +++ b/arch/powerpc/kernel/iommu.c
> >> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
> >>  	memset(tbl->it_map, 0xff, sz);
> >>  	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
> >>  
> >> -	/*
> >> -	 * Disable iommu bypass, otherwise the user can DMA to all of
> >> -	 * our physical memory via the bypass window instead of just
> >> -	 * the pages that has been explicitly mapped into the iommu
> >> -	 */
> >> -	if (tbl->set_bypass)
> >> -		tbl->set_bypass(tbl, false);
> >> -
> >>  	return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
> >> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
> >>  	/* Restore bit#0 set by iommu_init_table() */
> >>  	if (tbl->it_offset == 0)
> >>  		set_bit(0, tbl->it_map);
> >> -
> >> -	/* The kernel owns the device now, we can restore the iommu bypass */
> >> -	if (tbl->set_bypass)
> >> -		tbl->set_bypass(tbl, true);
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
> >>  
> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> index 495137b..f828c57 100644
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
> >>  		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >>  }
> >>  
> >> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >>  {
> >> -	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> >> -					      tce32.table);
> >>  	uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >>  	int64_t rc;
> >>  
> >> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
> >>  	/* TVE #1 is selected by PCI address bit 59 */
> >>  	pe->tce_bypass_base = 1ull << 59;
> >>  
> >> -	/* Install set_bypass callback for VFIO */
> >> -	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
> >> -
> >>  	/* Enable bypass by default */
> >> -	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
> >> +	pnv_pci_ioda2_set_bypass(pe, true);
> >> +}
> >> +
> >> +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
> >> +				     bool enable)
> >> +{
> >> +	struct pnv_ioda_pe *pe = data->iommu_owner;
> >> +
> >> +	pnv_pci_ioda2_set_bypass(pe, !enable);
> >>  }
> >>  
> >>  static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
> >>  	.get_table = pnv_ioda1_iommu_get_table,
> >> +	.take_ownership = pnv_ioda2_take_ownership,
> >>  };
> >>  
> >>  static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index 917c854..05b2f11 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -78,6 +78,13 @@ static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
> >>  	return ret;
> >>  }
> >>  
> >> +static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
> >> +		bool enable)
> >> +{
> >> +	if (data && data->ops && data->ops->take_ownership)
> >> +		data->ops->take_ownership(data, enable);
> >> +}
> >> +
> >>  static int tce_iommu_enable(struct tce_container *container)
> >>  {
> >>  	int ret = 0;
> >> @@ -386,6 +393,12 @@ static int tce_iommu_attach_group(void *iommu_data,
> >>  		ret = iommu_take_ownership(tbl);
> >>  		if (!ret)
> >>  			container->grp = iommu_group;
> >> +		/*
> >> +		 * Disable iommu bypass, otherwise the user can DMA to all of
> >> +		 * our physical memory via the bypass window instead of just
> >> +		 * the pages that has been explicitly mapped into the iommu
> >> +		 */
> >> +		tce_iommu_take_ownership_notify(data, true);
> >>  	}
> >>  
> >>  	mutex_unlock(&container->lock);
> >> @@ -423,6 +436,9 @@ static void tce_iommu_detach_group(void *iommu_data,
> >>  		BUG_ON(!tbl);
> >>  
> >>  		iommu_release_ownership(tbl);
> >> +
> >> +		/* Kernel owns the device now, we can restore bypass */
> >> +		tce_iommu_take_ownership_notify(data, false);
> >>  	}
> >>  	mutex_unlock(&container->lock);
> >>  }
> > 
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm
  2014-07-28  0:43   ` Benjamin Herrenschmidt
@ 2014-07-28  4:23     ` Alexey Kardashevskiy
  2014-07-28  4:34       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-28  4:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On 07/28/2014 10:43 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
> 
> You need a description.
> 
>>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++++++++++++++++++++++++++++++++++-
>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
>> index 516f2ee..48b7ed4 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
>>  		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>>  }
>>  
>> +/*
>> + * Checks ulimit in order not to let the user space to pin all
>> + * available memory for TCE tables.
>> + */
>> +static long kvmppc_account_memlimit(long npages)
>> +{
>> +	unsigned long ret = 0, locked, lock_limit;
>> +
>> +	if (!current->mm)
>> +		return -ESRCH; /* process exited */
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +	locked = current->mm->locked_vm + npages;
>> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> +		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> +				rlimit(RLIMIT_MEMLOCK));
>> +		ret = -ENOMEM;
>> +	} else {
>> +		current->mm->locked_vm += npages;
>> +	}
>> +	up_write(&current->mm->mmap_sem);
>> +
>> +	return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>>  {
>>  	struct kvm *kvm = stt->kvm;
>>  	int i;
>> +	long npages = kvmppc_stt_npages(stt->window_size);
>>  
>>  	mutex_lock(&kvm->lock);
>>  	list_del(&stt->list);
>> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
>> +	for (i = 0; i < npages; i++)
>>  		__free_page(stt->pages[i]);
>> +
>>  	kfree(stt);
>>  	mutex_unlock(&kvm->lock);
>>  
>> +	kvmppc_account_memlimit(-(npages + 1));
>> +
>>  	kvm_put_kvm(kvm);
>>  }
>>  
>> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  	}
>>  
>>  	npages = kvmppc_stt_npages(args->window_size);
>> +	ret = kvmppc_account_memlimit(npages + 1);
>> +	if (ret)
>> +		goto fail;
> 
> This is called for VFIO only or is it also called when creating TCE
> tables for emulated devices ? Because in the latter case, you don't
> want to account the pages as locked, do you ?

At the moment TCE-containing pages (for emulated TCE) are allocated with
alloc_page() which is kernel memory and therefore always locked, no?


> Also, you need to explain what +1
> 
> Finally, do I correctly deduce that creating 10 TCE tables of 2G
> each will end up accounting 20G as locked even if the guest for
> example only has 4G of RAM ? 


The user is required to set the limit to 20G, correct. But this does not
mean all 20G will be pinned. Ugly but better than nothing. As I remember
from you explanations, even if we give up real/virtual mode handlers for
H_PUT_TCE&Co, we cannot rely of existing counters in page struct in order
to understand whether we need to account a page again or not so we are
stuck with this code till we have a "clone DDW window" API.


But this patch is not about guest pages, it is about pages with TCEs, there
was no counting for this at all.



> 
>>  	stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
>>  		      GFP_KERNEL);
> 
> Ben.
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables
  2014-07-28  4:11     ` Alexey Kardashevskiy
@ 2014-07-28  4:30       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  4:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Mon, 2014-07-28 at 14:11 +1000, Alexey Kardashevskiy wrote:
> ===
> At the moment the spapr_tce_tables list is not protected against races
> which may happen if the userspace issues the KVM_CREATE_SPAPR_TCE ioctl
> to KVM from different threads.
> 
> This makes use of _rcu helpers for list_add()/list_del().
> ===
> 
> and it is missing this bit as well (was under impression that all tables
> get freed at once at KVM exit but this might not be the case for malicious
> guest):
> 
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -211,7 +211,7 @@ static void release_spapr_tce_table(struct
> kvmppc_spapr_tce_table *stt)
>         long npages = kvmppc_stt_npages(stt->size);
> 
>         mutex_lock(&kvm->lock);
> -       list_del(&stt->list);
> +       list_del_rcu(&stt->list);
> 
> 
> 
> Would this be sufficient?

Well, not really, we'd need to see the corresponding RCU'isms when
walking the list including the rcu read lock ... which could be
problematic if we are in real mode....

Cheers,
Ben.


> 
> 
> > 
> > Cheers,
> > Ben.
> > 
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 54cf9bc..516f2ee 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	kvm_get_kvm(kvm);
> >>  
> >>  	mutex_lock(&kvm->lock);
> >> -	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> >> +	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
> >>  
> >>  	mutex_unlock(&kvm->lock);
> >>  
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()
  2014-07-28  1:19   ` Benjamin Herrenschmidt
@ 2014-07-28  4:32     ` Alexey Kardashevskiy
  2014-07-28  4:35       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-28  4:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On 07/28/2014 11:19 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> This adds missing permission bits to the translated TCE.
> 
> Is this a bug fix for existing stuff ? If yes, submit it separately.


There is 15/18 patch which fixes possible bug with leaking pages, and that
patch won't work until this one is applied.

Merge this one and "[PATCH v3 15/18] powerpc/iommu: Implement put_page() if
TCE had non-zero value"?




> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  arch/powerpc/kernel/iommu.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 0cda2e8..5af2319 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
>>  		return -EFAULT;
>>  	}
>>  	hwaddr = (unsigned long) page_address(page) + offset;
>> +	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
>>  
>>  	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
>>  	if (ret)
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm
  2014-07-28  4:23     ` Alexey Kardashevskiy
@ 2014-07-28  4:34       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  4:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Mon, 2014-07-28 at 14:23 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 10:43 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> > 
> > You need a description.
> > 
> >>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 34 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 516f2ee..48b7ed4 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
> >>  		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
> >>  }
> >>  
> >> +/*
> >> + * Checks ulimit in order not to let the user space to pin all
> >> + * available memory for TCE tables.
> >> + */
> >> +static long kvmppc_account_memlimit(long npages)
> >> +{
> >> +	unsigned long ret = 0, locked, lock_limit;
> >> +
> >> +	if (!current->mm)
> >> +		return -ESRCH; /* process exited */
> >> +
> >> +	down_write(&current->mm->mmap_sem);
> >> +	locked = current->mm->locked_vm + npages;
> >> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> >> +		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +				rlimit(RLIMIT_MEMLOCK));
> >> +		ret = -ENOMEM;
> >> +	} else {
> >> +		current->mm->locked_vm += npages;
> >> +	}
> >> +	up_write(&current->mm->mmap_sem);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
> >>  {
> >>  	struct kvm *kvm = stt->kvm;
> >>  	int i;
> >> +	long npages = kvmppc_stt_npages(stt->window_size);
> >>  
> >>  	mutex_lock(&kvm->lock);
> >>  	list_del(&stt->list);
> >> -	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> >> +	for (i = 0; i < npages; i++)
> >>  		__free_page(stt->pages[i]);
> >> +
> >>  	kfree(stt);
> >>  	mutex_unlock(&kvm->lock);
> >>  
> >> +	kvmppc_account_memlimit(-(npages + 1));
> >> +
> >>  	kvm_put_kvm(kvm);
> >>  }
> >>  
> >> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>  	}
> >>  
> >>  	npages = kvmppc_stt_npages(args->window_size);
> >> +	ret = kvmppc_account_memlimit(npages + 1);
> >> +	if (ret)
> >> +		goto fail;
> > 
> > This is called for VFIO only or is it also called when creating TCE
> > tables for emulated devices ? Because in the latter case, you don't
> > want to account the pages as locked, do you ?
> 
> At the moment TCE-containing pages (for emulated TCE) are allocated with
> alloc_page() which is kernel memory and therefore always locked, no?

So the npages up there is the number of TCE-containing pages, not the
number of mapped-by-TCE pages ? In that case it makes sense yes.

> 
> > Also, you need to explain what +1
> > 
> > Finally, do I correctly deduce that creating 10 TCE tables of 2G
> > each will end up accounting 20G as locked even if the guest for
> > example only has 4G of RAM ? 
> 
> 
> The user is required to set the limit to 20G, correct. But this does not
> mean all 20G will be pinned. Ugly but better than nothing. As I remember
> from you explanations, even if we give up real/virtual mode handlers for
> H_PUT_TCE&Co, we cannot rely of existing counters in page struct in order
> to understand whether we need to account a page again or not so we are
> stuck with this code till we have a "clone DDW window" API.

Right but please put that explanation somewhere in one of the changeset
comments or as comments near the code.

> But this patch is not about guest pages, it is about pages with TCEs, there
> was no counting for this at all.

Ok.

> 
> 
> > 
> >>  	stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> >>  		      GFP_KERNEL);
> > 
> > Ben.
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()
  2014-07-28  4:32     ` Alexey Kardashevskiy
@ 2014-07-28  4:35       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2014-07-28  4:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On Mon, 2014-07-28 at 14:32 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 11:19 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> >> This adds missing permission bits to the translated TCE.
> > 
> > Is this a bug fix for existing stuff ? If yes, submit it separately.
> 
> 
> There is 15/18 patch which fixes possible bug with leaking pages, and that
> patch won't work until this one is applied.

Please collapse them then.

> Merge this one and "[PATCH v3 15/18] powerpc/iommu: Implement put_page() if
> TCE had non-zero value"?

Right, and if the result is a bug fix on top of something already
upstream, please send it separately.

Cheers,
Ben.

> 
> 
> > 
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  arch/powerpc/kernel/iommu.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >> index 0cda2e8..5af2319 100644
> >> --- a/arch/powerpc/kernel/iommu.c
> >> +++ b/arch/powerpc/kernel/iommu.c
> >> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
> >>  		return -EFAULT;
> >>  	}
> >>  	hwaddr = (unsigned long) page_address(page) + offset;
> >> +	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
> >>  
> >>  	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
> >>  	if (ret)
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper
  2014-07-28  0:46   ` Benjamin Herrenschmidt
@ 2014-07-28  9:12     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 35+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-28  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev, Gavin Shan

On 07/28/2014 10:46 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
>> Additional DMA windows support is coming and accounting will include them
>> so let's move this code to a helper for reuse.
> 
> This code looks a LOT like the one you added in the previous patch
> (ie. kvmppc_account_memlimit()), though in a different place. I
> wonder if we should re-arrange things so that there is really
> only one version of it..


One copy is in VFIO driver, another in KVM, they meet in VFIO KVM device
(which may miss) or IOMMU code (drivers/iommu/iommu.c or
arch/powerpc/kernel/iommu.c) but this math has nothing to do with IOMMU. I
fail to see good common place for this.


> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  drivers/vfio/vfio_iommu_spapr_tce.c | 54 ++++++++++++++++++++++++-------------
>>  1 file changed, 36 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index a84788b..c8f7284 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -47,10 +47,40 @@ struct tce_container {
>>  	bool enabled;
>>  };
>>  
>> +/*
>> + * Checks ulimit in order not to let the user space to pin all
>> + * available memory for TCE tables.
>> + */
>> +static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
>> +{
>> +	unsigned long ret = 0, locked, lock_limit;
>> +	long npages;
>> +
>> +	if (!current->mm)
>> +		return -ESRCH; /* process exited */
>> +
>> +	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>> +	if (!inc)
>> +		npages = -npages;
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +	locked = current->mm->locked_vm + npages;
>> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> +		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> +				rlimit(RLIMIT_MEMLOCK));
>> +		ret = -ENOMEM;
>> +	} else {
>> +		current->mm->locked_vm += npages;
>> +	}
>> +	up_write(&current->mm->mmap_sem);
>> +
>> +	return ret;
>> +}
>> +
>>  static int tce_iommu_enable(struct tce_container *container)
>>  {
>>  	int ret = 0;
>> -	unsigned long locked, lock_limit, npages;
>>  	struct iommu_table *tbl = container->tbl;
>>  
>>  	if (!container->tbl)
>> @@ -80,20 +110,11 @@ static int tce_iommu_enable(struct tce_container *container)
>>  	 * that would effectively kill the guest at random points, much better
>>  	 * enforcing the limit based on the max that the guest can map.
>>  	 */
>> -	down_write(&current->mm->mmap_sem);
>> -	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>> -	locked = current->mm->locked_vm + npages;
>> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> -		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> -				rlimit(RLIMIT_MEMLOCK));
>> -		ret = -ENOMEM;
>> -	} else {
>> +	ret = tce_iommu_account_memlimit(tbl, true);
>> +	if (ret)
>> +		return ret;
>>  
>> -		current->mm->locked_vm += npages;
>> -		container->enabled = true;
>> -	}
>> -	up_write(&current->mm->mmap_sem);
>> +	container->enabled = true;
>>  
>>  	return ret;
>>  }
>> @@ -108,10 +129,7 @@ static void tce_iommu_disable(struct tce_container *container)
>>  	if (!container->tbl || !current->mm)
>>  		return;
>>  
>> -	down_write(&current->mm->mmap_sem);
>> -	current->mm->locked_vm -= (container->tbl->it_size <<
>> -			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
>> -	up_write(&current->mm->mmap_sem);
>> +	tce_iommu_account_memlimit(container->tbl, false);
>>  }
>>  
>>  static void *tce_iommu_open(unsigned long arg)
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-07-28  9:12 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-24  8:47 [PATCH v3 00/18] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-07-24  8:47 ` [PATCH v3 01/18] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
2014-07-24  8:47 ` [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables Alexey Kardashevskiy
2014-07-28  0:40   ` Benjamin Herrenschmidt
2014-07-28  4:11     ` Alexey Kardashevskiy
2014-07-28  4:30       ` Benjamin Herrenschmidt
2014-07-24  8:47 ` [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm Alexey Kardashevskiy
2014-07-28  0:43   ` Benjamin Herrenschmidt
2014-07-28  4:23     ` Alexey Kardashevskiy
2014-07-28  4:34       ` Benjamin Herrenschmidt
2014-07-24  8:47 ` [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper Alexey Kardashevskiy
2014-07-28  0:46   ` Benjamin Herrenschmidt
2014-07-28  9:12     ` Alexey Kardashevskiy
2014-07-24  8:47 ` [PATCH v3 05/18] powerpc/powernv: Use it_page_shift for TCE invalidation Alexey Kardashevskiy
2014-07-24  8:47 ` [PATCH v3 06/18] powerpc/powernv: Use it_page_shift in TCE build Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 07/18] powerpc/powernv: Add a page size parameter to pnv_pci_setup_iommu_table() Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 08/18] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 09/18] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
2014-07-28  1:18   ` Benjamin Herrenschmidt
2014-07-28  3:55     ` Alexey Kardashevskiy
2014-07-28  4:19       ` Benjamin Herrenschmidt
2014-07-24  8:48 ` [PATCH v3 11/18] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode() Alexey Kardashevskiy
2014-07-28  1:19   ` Benjamin Herrenschmidt
2014-07-28  4:32     ` Alexey Kardashevskiy
2014-07-28  4:35       ` Benjamin Herrenschmidt
2014-07-24  8:48 ` [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values Alexey Kardashevskiy
2014-07-28  1:09   ` Benjamin Herrenschmidt
2014-07-28  1:16   ` Benjamin Herrenschmidt
2014-07-24  8:48 ` [PATCH v3 14/18] powerpc/powernv: Return non-zero TCE from pnv_tce_build Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 15/18] powerpc/iommu: Implement put_page() if TCE had non-zero value Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 16/18] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 17/18] vfio: Use it_page_size Alexey Kardashevskiy
2014-07-24  8:48 ` [PATCH v3 18/18] vfio: powerpc: Enable Dynamic DMA windows Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).