linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW)
@ 2014-07-30  9:31 Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 01/16] rcu: Define notrace version of list_for_each_entry_rcu and list_entry_rcu Alexey Kardashevskiy
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This prepares existing upstream kernel for DDW (Dynamic DMA windows) and
adds actual DDW support for VFIO.

This patchset does not contain any in-kernel acceleration stuff.

This patchset does not enable DDW for emulated devices.


Changes:
v4:
* addressed Ben's comments
* big rework with moving tce_xxx callbacks out of ppc_md

v3:
* applied multiple comments from Gavin regarding error checking
and callbacks placements

v2:
* moved "Account TCE pages in locked_vm" here (was in later series)
* added counting for huge window to locked_vm (ugly but better than nothing)
* fixed bug with missing >>PAGE_SHIFT when calling pfn_to_page




Alexey Kardashevskiy (16):
  rcu: Define notrace version of list_for_each_entry_rcu and
    list_entry_rcu
  KVM: PPC: Use RCU for arch.spapr_tce_tables
  mm: Add helpers for locked_vm
  KVM: PPC: Account TCE-containing pages in locked_vm
  powerpc/iommu: Fix comments with it_page_shift
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
    take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: powerpc/spapr: Reuse locked_vm accounting helpers
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h            |  33 ++-
 arch/powerpc/include/asm/kvm_host.h         |   1 +
 arch/powerpc/include/asm/machdep.h          |  25 --
 arch/powerpc/include/asm/tce.h              |  38 +++
 arch/powerpc/kernel/iommu.c                 | 158 ++++++++-----
 arch/powerpc/kernel/vio.c                   |   5 +-
 arch/powerpc/kvm/book3s.c                   |   2 +-
 arch/powerpc/kvm/book3s_64_vio.c            |  43 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c         |   6 +-
 arch/powerpc/platforms/cell/iommu.c         |   9 +-
 arch/powerpc/platforms/pasemi/iommu.c       |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 239 ++++++++++++++++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c        |  86 ++++---
 arch/powerpc/platforms/powernv/pci.h        |  16 +-
 arch/powerpc/platforms/pseries/iommu.c      |  77 ++++--
 arch/powerpc/sysdev/dart_iommu.c            |  13 +-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 348 ++++++++++++++++++++++++----
 include/linux/mm.h                          |   3 +
 include/linux/rculist.h                     |  38 +++
 include/uapi/linux/vfio.h                   |  37 ++-
 mm/mlock.c                                  |  49 ++++
 22 files changed, 990 insertions(+), 248 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 01/16] rcu: Define notrace version of list_for_each_entry_rcu and list_entry_rcu
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables Alexey Kardashevskiy
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This defines list_for_each_entry_rcu_notrace which uses
new list_entry_rcu_notrace which uses rcu_dereference_raw_notrace instead
of rcu_dereference_raw whici allows us using
list_for_each_entry_rcu_notrace when MMU is off (real mode).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/rculist.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8183b46..a155774 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -253,6 +253,25 @@ static inline void list_splice_init_rcu(struct list_head *list,
 })
 
 /**
+ * list_entry_rcu_notrace - get the struct for this entry
+ * @ptr:        the &struct list_head pointer.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_struct within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ *
+ * This is the same as list_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_entry_rcu_notrace(ptr, type, member) \
+({ \
+	typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
+	container_of((typeof(ptr))rcu_dereference_raw_notrace(__ptr), \
+			type, member); \
+})
+
+/**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
  *
  * Implementing those functions following their counterparts list_empty() and
@@ -308,6 +327,25 @@ static inline void list_splice_init_rcu(struct list_head *list,
 		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_rcu_notrace - iterate over rcu list of given type
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently with
+ * the _rcu list-mutation primitives such as list_add_rcu()
+ * as long as the traversal is guarded by rcu_read_lock().
+ *
+ * This is the same as list_for_each_entry_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define list_for_each_entry_rcu_notrace(pos, head, member) \
+	for (pos = list_entry_rcu_notrace((head)->next, typeof(*pos), member); \
+		&pos->member != (head); \
+		pos = list_entry_rcu_notrace(pos->member.next, typeof(*pos), \
+				member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 01/16] rcu: Define notrace version of list_for_each_entry_rcu and list_entry_rcu Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-08-21  5:25   ` Paul Mackerras
  2014-07-30  9:31 ` [PATCH v4 03/16] mm: Add helpers for locked_vm Alexey Kardashevskiy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

At the moment spapr_tce_tables is not protected against races. This makes
use of RCU-variants of list helpers. As some bits are executed in real
mode, this makes use of just introduced list_for_each_entry_rcu_notrace().

This converts release_spapr_tce_table() to a RCU scheduled handler.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
* total rework
* kfree() for kvmppc_spapr_tce_table is moved to call_rcu_sched() callback
* used new list_for_each_entry_rcu_notrace
---
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s.c           |  2 +-
 arch/powerpc/kvm/book3s_64_vio.c    | 23 +++++++++++++----------
 arch/powerpc/kvm/book3s_64_vio_hv.c |  6 ++++--
 4 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index bb66d8b..cd22c31 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	struct rcu_head rcu;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c254c27..9e17d19 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -886,7 +886,7 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 {
 
 #ifdef CONFIG_PPC64
-	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
+	INIT_LIST_HEAD_RCU(&kvm->arch.spapr_tce_tables);
 	INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 54cf9bc..5958f7d 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,19 +45,16 @@ static long kvmppc_stt_npages(unsigned long window_size)
 		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
-static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
+static void release_spapr_tce_table(struct rcu_head *head)
 {
-	struct kvm *kvm = stt->kvm;
+	struct kvmppc_spapr_tce_table *stt = container_of(head,
+			struct kvmppc_spapr_tce_table, rcu);
 	int i;
 
-	mutex_lock(&kvm->lock);
-	list_del(&stt->list);
 	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
 		__free_page(stt->pages[i]);
+	kvm_put_kvm(stt->kvm);
 	kfree(stt);
-	mutex_unlock(&kvm->lock);
-
-	kvm_put_kvm(kvm);
 }
 
 static int kvm_spapr_tce_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -87,8 +84,13 @@ static int kvm_spapr_tce_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
 {
 	struct kvmppc_spapr_tce_table *stt = filp->private_data;
+	struct kvm *kvm = stt->kvm;
+
+	mutex_lock(&kvm->lock);
+	list_del_rcu(&stt->list);
+	call_rcu_sched(&stt->rcu, release_spapr_tce_table);
+	mutex_unlock(&kvm->lock);
 
-	release_spapr_tce_table(stt);
 	return 0;
 }
 
@@ -106,7 +108,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	int i;
 
 	/* Check this LIOBN hasn't been previously allocated */
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
+			list) {
 		if (stt->liobn == args->liobn)
 			return -EBUSY;
 	}
@@ -131,7 +134,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
-	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
 
 	mutex_unlock(&kvm->lock);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 89e96b3..b1914d9 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -50,7 +50,8 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
 	/* 	    liobn, ioba, tce); */
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
+			list) {
 		if (stt->liobn == liobn) {
 			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
 			struct page *page;
@@ -82,7 +83,8 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
+			list) {
 		if (stt->liobn == liobn) {
 			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
 			struct page *page;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 03/16] mm: Add helpers for locked_vm
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 01/16] rcu: Define notrace version of list_for_each_entry_rcu and list_entry_rcu Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 04/16] KVM: PPC: Account TCE-containing pages in locked_vm Alexey Kardashevskiy
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This adds 2 helpers to change the locked_vm counter:
- try_increase_locked_vm - may fail if new locked_vm value will be greater
than the RLIMIT_MEMLOCK limit;
- decrease_locked_vm.

These will be used by drivers capable of locking memory by userspace
request. For example, VFIO can use it to check if it can lock DMA memory
or PPC-KVM can use it to check if it can lock memory for TCE tables.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 include/linux/mm.h |  3 +++
 mm/mlock.c         | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..1cb219d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+extern long try_increment_locked_vm(long npages);
+extern void decrement_locked_vm(long npages);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb536..39e4b55 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
 	spin_unlock(&shmlock_user_lock);
 	free_uid(user);
 }
+
+/**
+ * try_increment_locked_vm() - checks if new locked_vm value is going to
+ * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
+ *
+ * @npages: the number of pages to add to locked_vm.
+ *
+ * Returns 0 if succeeded or negative value if failed.
+ */
+long try_increment_locked_vm(long npages)
+{
+	long ret = 0, locked, lock_limit;
+
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	locked = current->mm->locked_vm + npages;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+				rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+	} else {
+		current->mm->locked_vm += npages;
+	}
+	up_write(&current->mm->mmap_sem);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(try_increment_locked_vm);
+
+/**
+ * decrement_locked_vm() - decrements the current task's locked_vm counter.
+ *
+ * @npages: the number to decrement by.
+ */
+void decrement_locked_vm(long npages)
+{
+	if (!current || !current->mm)
+		return; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	if (npages > current->mm->locked_vm)
+		npages = current->mm->locked_vm;
+	current->mm->locked_vm -= npages;
+	up_write(&current->mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(decrement_locked_vm);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 04/16] KVM: PPC: Account TCE-containing pages in locked_vm
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 03/16] mm: Add helpers for locked_vm Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 05/16] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

At the moment pages used for TCE tables (not pages addressed by TCEs) are
not counter in locked_vm counter so a malicious userspace tool can
call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and
lock a lot of memory.

This adds counting for pages used for TCE tables.

This counts the number of pages required for a table plus pages for
the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

This does not change the amount of (de)allocated memory.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* fixed counting for kvmppc_spapr_tce_table (used to be +1 page)
* added 2 helpers to common MM code for later reuse from vfio-spapr
---
 arch/powerpc/kvm/book3s_64_vio.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 5958f7d..b32aeb1 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,16 +45,33 @@ static long kvmppc_stt_npages(unsigned long window_size)
 		     * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
 }
 
+static long kvmppc_account_memlimit(long npages, bool inc)
+{
+	long stt_pages = ALIGN(sizeof(struct kvmppc_spapr_tce_table) +
+			(abs(npages) * sizeof(struct page *)), PAGE_SIZE);
+
+	npages += stt_pages;
+	if (inc)
+		return try_increment_locked_vm(npages);
+
+	decrement_locked_vm(npages);
+
+	return 0;
+}
+
 static void release_spapr_tce_table(struct rcu_head *head)
 {
 	struct kvmppc_spapr_tce_table *stt = container_of(head,
 			struct kvmppc_spapr_tce_table, rcu);
 	int i;
+	long npages = kvmppc_stt_npages(stt->window_size);
 
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+	for (i = 0; i < npages; i++)
 		__free_page(stt->pages[i]);
 	kvm_put_kvm(stt->kvm);
 	kfree(stt);
+
+	kvmppc_account_memlimit(npages, false);
 }
 
 static int kvm_spapr_tce_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -115,6 +132,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 	}
 
 	npages = kvmppc_stt_npages(args->window_size);
+	ret = kvmppc_account_memlimit(npages, true);
+	if (ret)
+		goto fail;
 
 	stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
 		      GFP_KERNEL);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 05/16] powerpc/iommu: Fix comments with it_page_shift
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 04/16] KVM: PPC: Account TCE-containing pages in locked_vm Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 06/16] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

There is a couple of commented debug prints which still use
IOMMU_PAGE_SHIFT() which is not defined for POWERPC anymore, replace
them with it_page_shift.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 88e3ec6..f84f799 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1037,7 +1037,7 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
-			__func__, hwaddr, entry << IOMMU_PAGE_SHIFT(tbl),
+			__func__, hwaddr, entry << tbl->it_page_shift,
 				hwaddr, ret); */
 
 	return ret;
@@ -1056,7 +1056,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 			direction != DMA_TO_DEVICE, &page);
 	if (unlikely(ret != 1)) {
 		/* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
-				tce, entry << IOMMU_PAGE_SHIFT(tbl), ret); */
+				tce, entry << tbl->it_page_shift, ret); */
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 06/16] powerpc/powernv: Make invalidate() a callback
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 05/16] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 07/16] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via
container_of(tbl). Since we are going to have to add Dynamic DMA windows
and that means having 2 IOMMU tables per PE, this is not going to work.

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* changed commit log to explain why this change is needed
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++++++++++--------------------
 arch/powerpc/platforms/powernv/pci.c      | 31 +++++++++++++++++++++--------
 arch/powerpc/platforms/powernv/pci.h      | 13 +++++++++++-
 3 files changed, 47 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9f28e18..007497f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -462,7 +462,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base(&pdev->dev, &pe->tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -489,7 +489,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, &pe->tce32.table);
 	}
 	return 0;
 }
@@ -499,7 +499,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		set_iommu_table_base_and_group(&dev->dev, &pe->tce32_table);
+		set_iommu_table_base_and_group(&dev->dev, &pe->tce32.table);
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
 	}
@@ -584,19 +584,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-				 __be64 *startp, __be64 *endp, bool rm)
-{
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
-	struct pnv_phb *phb = pe->phb;
-
-	if (phb->type == PNV_PHB_IODA1)
-		pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-	else
-		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -654,9 +641,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce_invalidate = pnv_pci_ioda1_tce_invalidate;
 
 	/* OPAL variant of P7IOC SW invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
@@ -693,7 +682,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
 	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -734,10 +723,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -785,9 +774,11 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = &pe->tce32.table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
+	pe->tce32.pe = pe;
+	pe->tce_invalidate = pnv_pci_ioda2_tce_invalidate;
 
 	/* OPAL variant of PHB3 invalidated TCEs */
 	swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4dff552..74a2626 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -550,6 +550,27 @@ struct pci_ops pnv_pci_ops = {
 	.write = pnv_pci_write_config,
 };
 
+static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
+	__be64 *endp, bool rm)
+{
+	struct pnv_iommu_table *ptbl = container_of(tbl,
+			struct pnv_iommu_table, table);
+	struct pnv_ioda_pe *pe = ptbl->pe;
+
+	/*
+	 * Some implementations won't cache invalid TCEs and thus may not
+	 * need that flush. We'll probably turn it_type into a bit mask
+	 * of flags if that becomes the case
+	 */
+	if (!(tbl->it_type & TCE_PCI_SWINV_FREE))
+		return;
+
+	if (!pe || !pe->tce_invalidate)
+		return;
+
+	pe->tce_invalidate(pe, tbl, startp, endp, rm);
+}
+
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 			 unsigned long uaddr, enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
@@ -570,12 +591,7 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		*(tcep++) = cpu_to_be64(proto_tce |
 				(rpn++ << tbl->it_page_shift));
 
-	/* Some implementations won't cache invalid TCEs and thus may not
-	 * need that flush. We'll probably turn it_type into a bit mask
-	 * of flags if that becomes the case
-	 */
-	if (tbl->it_type & TCE_PCI_SWINV_CREATE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 
 	return 0;
 }
@@ -599,8 +615,7 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
 	while (npages--)
 		*(tcep++) = cpu_to_be64(0);
 
-	if (tbl->it_type & TCE_PCI_SWINV_FREE)
-		pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1, rm);
+	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 }
 
 static void pnv_tce_free_vm(struct iommu_table *tbl, long index, long npages)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6f5ff69..32847a5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -22,6 +22,16 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS		(1 << 1)	/* PE has primary PCI bus	*/
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 
+struct pnv_ioda_pe;
+typedef void (*pnv_invalidate_fn)(struct pnv_ioda_pe *pe,
+		struct iommu_table *tbl,
+		__be64 *startp, __be64 *endp, bool rm);
+
+struct pnv_iommu_table {
+	struct iommu_table	table;
+	struct pnv_ioda_pe	*pe;
+};
+
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
 struct pnv_ioda_pe {
@@ -49,9 +59,10 @@ struct pnv_ioda_pe {
 	unsigned int		dma_weight;
 
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
+	pnv_invalidate_fn	tce_invalidate;
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 07/16] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 06/16] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 08/16] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

At the moment a get_table() callback is the only one. It returns
an iommu_table for a bus address.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is fixed to use new type.
This changes the tce_container struct to keep iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
	- iommu_table;
	- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |   6 ++
 arch/powerpc/include/asm/tce.h              |  15 ++++
 arch/powerpc/kernel/iommu.c                 |  34 ++++++++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  39 +++++++++-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c        |   2 +-
 arch/powerpc/platforms/pseries/iommu.c      |  10 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c         | 113 +++++++++++++++++++++-------
 8 files changed, 184 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+				 void *iommu_owner,
+				 struct spapr_tce_iommu_ops *ops,
 				 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+					void *iommu_owner,
+					struct spapr_tce_iommu_ops *ops,
 					int pci_domain_number,
 					unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..8bfe98f 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,20 @@
 #define TCE_PCI_READ		0x1		/* read from PCI allowed */
 #define TCE_VB_WRITE		0x1		/* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+#define TCE_DEFAULT_WINDOW	~(0ULL)
+
+struct spapr_tce_iommu_ops {
+	struct iommu_table *(*get_table)(
+			struct spapr_tce_iommu_group *data,
+			phys_addr_t addr);
+};
+
+struct spapr_tce_iommu_group {
+	void *iommu_owner;
+	struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f84f799..e203314 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -877,24 +877,52 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
  */
 static void group_release(void *iommu_data)
 {
-	struct iommu_table *tbl = iommu_data;
-	tbl->it_group = NULL;
+	kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_get_default_table(
+		struct spapr_tce_iommu_group *data, phys_addr_t addr)
+{
+	struct iommu_table *tbl = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return tbl;
+
+	if ((addr >> tbl->it_page_shift) < tbl->it_size)
+		return tbl;
+
+	return NULL;
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+	.get_table = spapr_tce_get_default_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+		void *iommu_owner, struct spapr_tce_iommu_ops *ops,
 		int pci_domain_number, unsigned long pe_num)
 {
 	struct iommu_group *grp;
 	char *name;
+	struct spapr_tce_iommu_group *data;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return;
+
+	data->iommu_owner = iommu_owner ? iommu_owner : tbl;
+	data->ops = ops ? ops : &spapr_tce_default_ops;
 
 	grp = iommu_group_alloc();
 	if (IS_ERR(grp)) {
 		pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
 				PTR_ERR(grp));
+		kfree(data);
 		return;
 	}
+
 	tbl->it_group = grp;
-	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu_group_set_iommudata(grp, data, group_release);
 	name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			pci_domain_number, pe_num);
 	if (!name)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 007497f..495137b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include <linux/io.h>
 #include <linux/msi.h>
 #include <linux/memblock.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -584,6 +585,34 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 	}
 }
 
+static bool pnv_pci_ioda_check_addr(struct iommu_table *tbl, __u64 start_addr)
+{
+	unsigned long entry = start_addr >> tbl->it_page_shift;
+	unsigned long start = tbl->it_offset;
+	unsigned long end = start + tbl->it_size;
+
+	return (start <= entry) && (entry < end);
+}
+
+static struct iommu_table *pnv_ioda1_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		phys_addr_t addr)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return &pe->tce32.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce32.table, addr))
+		return &pe->tce32.table;
+
+	return NULL;
+}
+
+static struct spapr_tce_iommu_ops pnv_pci_ioda1_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				      struct pnv_ioda_pe *pe, unsigned int base,
 				      unsigned int segs)
@@ -663,7 +692,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
@@ -729,6 +759,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
 }
 
+static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
+	.get_table = pnv_ioda1_iommu_get_table,
+};
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -794,7 +828,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
+			phb->hose->global_number, pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 94ce348..b79066d 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -89,6 +89,7 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
+				NULL, NULL,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
 	}
 
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 74a2626..cc54e3b 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -674,7 +674,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
 	iommu_init_table(tbl, hose->node);
-	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 33b552f..a047754 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -616,7 +616,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 	iommu_table_setparms(pci->phb, dn, tbl);
 	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-	iommu_register_group(tbl, pci_domain_nr(bus), 0);
+	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
 	pci->phb->dma_window_size = 0x80000000ul;
@@ -661,7 +661,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
 		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(bus), 0);
+		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
 }
@@ -688,7 +688,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
 		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
-		iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
 					       PCI_DN(dn)->iommu_table);
 		return;
@@ -1104,7 +1105,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
 		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
-		iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
+		iommu_register_group(tbl, NULL, NULL,
+				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
 	} else {
 		pr_debug("  found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a84788b..46c2bc8 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -43,7 +43,7 @@ static void tce_iommu_detach_group(void *iommu_data,
  */
 struct tce_container {
 	struct mutex lock;
-	struct iommu_table *tbl;
+	struct iommu_group *grp;
 	bool enabled;
 };
 
@@ -51,9 +51,14 @@ static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
 	unsigned long locked, lock_limit, npages;
-	struct iommu_table *tbl = container->tbl;
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	if (!container->tbl)
+	if (!container->grp)
+		return -ENXIO;
+
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return -ENXIO;
 
 	if (!current->mm)
@@ -80,6 +85,10 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
 	 */
+	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+	if (!tbl)
+		return -ENXIO;
+
 	down_write(&current->mm->mmap_sem);
 	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	locked = current->mm->locked_vm + npages;
@@ -89,7 +98,6 @@ static int tce_iommu_enable(struct tce_container *container)
 				rlimit(RLIMIT_MEMLOCK));
 		ret = -ENOMEM;
 	} else {
-
 		current->mm->locked_vm += npages;
 		container->enabled = true;
 	}
@@ -100,16 +108,27 @@ static int tce_iommu_enable(struct tce_container *container)
 
 static void tce_iommu_disable(struct tce_container *container)
 {
+	struct spapr_tce_iommu_group *data;
+	struct iommu_table *tbl;
+
 	if (!container->enabled)
 		return;
 
 	container->enabled = false;
 
-	if (!container->tbl || !current->mm)
+	if (!container->grp || !current->mm)
+		return;
+
+	data = iommu_group_get_iommudata(container->grp);
+	if (!data || !data->iommu_owner || !data->ops->get_table)
+		return;
+
+	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+	if (!tbl)
 		return;
 
 	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (container->tbl->it_size <<
+	current->mm->locked_vm -= (tbl->it_size <<
 			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
 	up_write(&current->mm->mmap_sem);
 }
@@ -136,11 +155,11 @@ static void tce_iommu_release(void *iommu_data)
 {
 	struct tce_container *container = iommu_data;
 
-	WARN_ON(container->tbl && !container->tbl->it_group);
+	WARN_ON(container->grp);
 	tce_iommu_disable(container);
 
-	if (container->tbl && container->tbl->it_group)
-		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+	if (container->grp)
+		tce_iommu_detach_group(iommu_data, container->grp);
 
 	mutex_destroy(&container->lock);
 
@@ -160,8 +179,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
 		if (WARN_ON(!tbl))
 			return -ENXIO;
 
@@ -185,13 +213,16 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_MAP_DMA: {
 		struct vfio_iommu_type1_dma_map param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 		unsigned long tce, i;
 
-		if (!tbl)
+		if (WARN_ON(!container->grp))
 			return -ENXIO;
 
-		BUG_ON(!tbl->it_group);
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -216,6 +247,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
 			tce |= TCE_PCI_WRITE;
 
+		tbl = data->ops->get_table(data, param.iova);
+		if (!tbl)
+			return -ENXIO;
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_put_param_check(tbl, param.iova, tce);
 		if (ret)
 			return ret;
@@ -238,9 +274,14 @@ static long tce_iommu_ioctl(void *iommu_data,
 	}
 	case VFIO_IOMMU_UNMAP_DMA: {
 		struct vfio_iommu_type1_dma_unmap param;
-		struct iommu_table *tbl = container->tbl;
+		struct iommu_table *tbl;
+		struct spapr_tce_iommu_group *data;
 
-		if (WARN_ON(!tbl))
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
 			return -ENXIO;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
@@ -259,6 +300,12 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (param.size & ~IOMMU_PAGE_MASK_4K)
 			return -EINVAL;
 
+		tbl = data->ops->get_table(data, param.iova);
+		if (WARN_ON(!tbl))
+			return -ENXIO;
+
+		BUG_ON(!tbl->it_group);
+
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
 				param.size >> IOMMU_PAGE_SHIFT_4K);
 		if (ret)
@@ -293,16 +340,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 {
 	int ret;
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
 
 	/* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
 			iommu_group_id(iommu_group), iommu_group); */
-	if (container->tbl) {
+	if (container->grp) {
 		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
-				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(container->grp),
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else if (container->enabled) {
@@ -310,9 +357,16 @@ static int tce_iommu_attach_group(void *iommu_data,
 				iommu_group_id(iommu_group));
 		ret = -EBUSY;
 	} else {
+		data = iommu_group_get_iommudata(iommu_group);
+		if (WARN_ON(!data || !data->iommu_owner || !data->ops))
+			return -ENXIO;
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+		BUG_ON(!tbl);
+
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
-			container->tbl = tbl;
+			container->grp = iommu_group;
 	}
 
 	mutex_unlock(&container->lock);
@@ -324,24 +378,31 @@ static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group)
 {
 	struct tce_container *container = iommu_data;
-	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+	struct iommu_table *tbl;
+	struct spapr_tce_iommu_group *data;
 
-	BUG_ON(!tbl);
 	mutex_lock(&container->lock);
-	if (tbl != container->tbl) {
+	if (iommu_group != container->grp) {
 		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
 				iommu_group_id(iommu_group),
-				iommu_group_id(tbl->it_group));
+				iommu_group_id(container->grp));
 	} else {
 		if (container->enabled) {
 			pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
-					iommu_group_id(tbl->it_group));
+					iommu_group_id(container->grp));
 			tce_iommu_disable(container);
 		}
 
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
-		container->tbl = NULL;
+		container->grp = NULL;
+
+		data = iommu_group_get_iommudata(iommu_group);
+		BUG_ON(!data || !data->iommu_owner || !data->ops);
+
+		tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
+		BUG_ON(!tbl);
+
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 08/16] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 07/16] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 09/16] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |  1 -
 arch/powerpc/include/asm/tce.h            |  2 ++
 arch/powerpc/kernel/iommu.c               | 12 ------------
 arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++++++++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c       | 17 +++++++++++++++++
 5 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
-	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 8bfe98f..5ee4987 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
 	struct iommu_table *(*get_table)(
 			struct spapr_tce_iommu_group *data,
 			phys_addr_t addr);
+	void (*take_ownership)(struct spapr_tce_iommu_group *data,
+			bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index e203314..06984d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
-	/*
-	 * Disable iommu bypass, otherwise the user can DMA to all of
-	 * our physical memory via the bypass window instead of just
-	 * the pages that has been explicitly mapped into the iommu
-	 */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, false);
-
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
-
-	/* The kernel owns the device now, we can restore the iommu bypass */
-	if (tbl->set_bypass)
-		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 495137b..f828c57 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32.table);
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	/* TVE #1 is selected by PCI address bit 59 */
 	pe->tce_bypass_base = 1ull << 59;
 
-	/* Install set_bypass callback for VFIO */
-	pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
+	pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+				     bool enable)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
 	.get_table = pnv_ioda1_iommu_get_table,
+	.take_ownership = pnv_ioda2_take_ownership,
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 46c2bc8..d9845af 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,14 @@ struct tce_container {
 	bool enabled;
 };
 
+
+static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
+		bool enable)
+{
+	if (data && data->ops && data->ops->take_ownership)
+		data->ops->take_ownership(data, enable);
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
@@ -367,6 +375,12 @@ static int tce_iommu_attach_group(void *iommu_data,
 		ret = iommu_take_ownership(tbl);
 		if (!ret)
 			container->grp = iommu_group;
+		/*
+		 * Disable iommu bypass, otherwise the user can DMA to all of
+		 * our physical memory via the bypass window instead of just
+		 * the pages that has been explicitly mapped into the iommu
+		 */
+		tce_iommu_take_ownership_notify(data, true);
 	}
 
 	mutex_unlock(&container->lock);
@@ -404,6 +418,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 		BUG_ON(!tbl);
 
 		iommu_release_ownership(tbl);
+
+		/* Kernel owns the device now, we can restore bypass */
+		tce_iommu_take_ownership_notify(data, false);
 	}
 	mutex_unlock(&container->lock);
 }
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 09/16] powerpc/iommu: Fix IOMMU ownership control functions
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 08/16] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 10/16] powerpc: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/kernel/iommu.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 06984d5..c94b11d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1103,33 +1103,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+	int ret = 0, bit0 = 0;
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
 
 	if (tbl->it_offset == 0)
-		clear_bit(0, tbl->it_map);
+		bit0 = test_and_clear_bit(0, tbl->it_map);
 
 	if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
 		pr_err("iommu_tce: it_map is not empty");
-		return -EBUSY;
+		ret = -EBUSY;
+		if (bit0)
+			set_bit(0, tbl->it_map);
+	} else {
+		memset(tbl->it_map, 0xff, sz);
 	}
 
-	memset(tbl->it_map, 0xff, sz);
-	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
-	return 0;
+	if (!ret)
+		iommu_clear_tces_and_put_pages(tbl, tbl->it_offset,
+				tbl->it_size);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	spin_lock_irqsave(&tbl->large_pool.lock, flags);
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_lock(&tbl->pools[i].lock);
+
 	memset(tbl->it_map, 0, sz);
 
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	for (i = 0; i < tbl->nr_pools; i++)
+		spin_unlock(&tbl->pools[i].lock);
+	spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 10/16] powerpc: Move tce_xxx callbacks from ppc_md to iommu_table
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 09/16] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds an extra @ops parameter to iommu_init_table() to make sure
that we do not leave any IOMMU table without iommu_table_ops. @it_ops is
initialized in the very beginning as iommu_init_table() calls
iommu_table_clear() and the latter uses callbacks already.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" prefixes
for better readability.

This removes tce_xxx_rm handlers from ppc_md as well but does not add
them to iommu_table_ops, this will be done later if we decide to support
TCE hypercalls in real mode.

This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as
callbacks for pseries. This changes "multi" callbacks to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            | 20 +++++++++++-
 arch/powerpc/include/asm/machdep.h          | 25 ---------------
 arch/powerpc/kernel/iommu.c                 | 50 ++++++++++++++++-------------
 arch/powerpc/kernel/vio.c                   |  5 ++-
 arch/powerpc/platforms/cell/iommu.c         |  9 ++++--
 arch/powerpc/platforms/pasemi/iommu.c       |  8 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 +--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 +-
 arch/powerpc/platforms/powernv/pci.c        | 24 ++++----------
 arch/powerpc/platforms/powernv/pci.h        |  1 +
 arch/powerpc/platforms/pseries/iommu.c      | 42 +++++++++++++-----------
 arch/powerpc/sysdev/dart_iommu.c            | 13 ++++----
 12 files changed, 102 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2b0b01d..c725e4a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+	int (*set)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
+	void (*clear)(struct iommu_table *tbl,
+			long index, long npages);
+	unsigned long (*get)(struct iommu_table *tbl, long index);
+	void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	struct iommu_table_ops *it_ops;
 };
 
 /* Pure 2^n version of get_order */
@@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  * structure
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-					    int nid);
+					    int nid,
+					    struct iommu_table_ops *ops);
 
 struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index f92b0b5..0a2ec04 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
 
-	int		(*tce_build)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	unsigned long	(*tce_get)(struct iommu_table *tbl,
-				    long index);
-	void		(*tce_flush)(struct iommu_table *tbl);
-
-	/* _rm versions are for real mode use only */
-	int		(*tce_build_rm)(struct iommu_table *tbl,
-				     long index,
-				     long npages,
-				     unsigned long uaddr,
-				     enum dma_data_direction direction,
-				     struct dma_attrs *attrs);
-	void		(*tce_free_rm)(struct iommu_table *tbl,
-				    long index,
-				    long npages);
-	void		(*tce_flush_rm)(struct iommu_table *tbl);
-
 	void __iomem *	(*ioremap)(phys_addr_t addr, unsigned long size,
 				   unsigned long flags, void *caller);
 	void		(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c94b11d..6a86788 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -322,11 +322,11 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	ret = entry << tbl->it_page_shift;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
-	build_fail = ppc_md.tce_build(tbl, entry, npages,
+	build_fail = tbl->it_ops->set(tbl, entry, npages,
 				      (unsigned long)page &
 				      IOMMU_PAGE_MASK(tbl), direction, attrs);
 
-	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	/* tbl->it_ops->set() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
 	 * DMA_ERROR_CODE. For all other errors the functionality is
 	 * not altered.
@@ -337,8 +337,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -408,7 +408,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	if (!iommu_free_check(tbl, dma_addr, npages))
 		return;
 
-	ppc_md.tce_free(tbl, entry, npages);
+	tbl->it_ops->clear(tbl, entry, npages);
 
 	spin_lock_irqsave(&(pool->lock), flags);
 	bitmap_clear(tbl->it_map, free_entry, npages);
@@ -424,8 +424,8 @@ static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 	 * not do an mb() here on purpose, it is not needed on any of
 	 * the current platforms.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
@@ -495,7 +495,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		build_fail = ppc_md.tce_build(tbl, entry, npages,
+		build_fail = tbl->it_ops->set(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
 					      direction, attrs);
 		if(unlikely(build_fail))
@@ -534,8 +534,8 @@ int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 	}
 
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	DBG("mapped %d elements:\n", outcount);
 
@@ -600,8 +600,8 @@ void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist,
 	 * do not do an mb() here, the affected platforms do not need it
 	 * when freeing.
 	 */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 }
 
 static void iommu_table_clear(struct iommu_table *tbl)
@@ -613,17 +613,17 @@ static void iommu_table_clear(struct iommu_table *tbl)
 	 */
 	if (!is_kdump_kernel() || is_fadump_active()) {
 		/* Clear the table in case firmware left allocations in it */
-		ppc_md.tce_free(tbl, tbl->it_offset, tbl->it_size);
+		tbl->it_ops->clear(tbl, tbl->it_offset, tbl->it_size);
 		return;
 	}
 
 #ifdef CONFIG_CRASH_DUMP
-	if (ppc_md.tce_get) {
+	if (tbl->it_ops->get) {
 		unsigned long index, tceval, tcecount = 0;
 
 		/* Reserve the existing mappings left by the first kernel. */
 		for (index = 0; index < tbl->it_size; index++) {
-			tceval = ppc_md.tce_get(tbl, index + tbl->it_offset);
+			tceval = tbl->it_ops->get(tbl, index + tbl->it_offset);
 			/*
 			 * Freed TCE entry contains 0x7fffffffffffffff on JS20
 			 */
@@ -649,7 +649,8 @@ static void iommu_table_clear(struct iommu_table *tbl)
  * Build a iommu_table structure.  This contains a bit map which
  * is used to manage allocation of the tce space.
  */
-struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
+struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
+		struct iommu_table_ops *ops)
 {
 	unsigned long sz;
 	static int welcomed = 0;
@@ -657,6 +658,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	unsigned int i;
 	struct iommu_pool *p;
 
+	BUG_ON(!ops);
+	tbl->it_ops = ops;
+
 	/* number of bytes needed for the bitmap */
 	sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
 
@@ -947,8 +951,8 @@ EXPORT_SYMBOL_GPL(iommu_tce_direction);
 void iommu_flush_tce(struct iommu_table *tbl)
 {
 	/* Flush/invalidate TLB caches if necessary */
-	if (ppc_md.tce_flush)
-		ppc_md.tce_flush(tbl);
+	if (tbl->it_ops->flush)
+		tbl->it_ops->flush(tbl);
 
 	/* Make sure updates are seen by hardware */
 	mb();
@@ -959,7 +963,7 @@ int iommu_tce_clear_param_check(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce_value,
 		unsigned long npages)
 {
-	/* ppc_md.tce_free() does not support any value but 0 */
+	/* tbl->it_ops->clear() does not support any value but 0 */
 	if (tce_value)
 		return -EINVAL;
 
@@ -1007,9 +1011,9 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-		ppc_md.tce_free(tbl, entry, 1);
+		tbl->it_ops->clear(tbl, entry, 1);
 	else
 		oldtce = 0;
 
@@ -1056,10 +1060,10 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 
 	spin_lock(&(pool->lock));
 
-	oldtce = ppc_md.tce_get(tbl, entry);
+	oldtce = tbl->it_ops->get(tbl, entry);
 	/* Add new entry if it is not busy */
 	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
 
 	spin_unlock(&(pool->lock));
 
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 904c661..66e4d74 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1196,7 +1196,10 @@ static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 	tbl->it_type = TCE_VB;
 	tbl->it_blocksize = 16;
 
-	return iommu_init_table(tbl, -1);
+	return iommu_init_table(tbl, -1,
+			firmware_has_feature(FW_FEATURE_LPAR) ?
+			&iommu_table_lpar_multi_ops :
+			&iommu_table_pseries_ops);
 }
 
 /**
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 2b90ff8..f6254f7 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -465,6 +465,11 @@ static inline u32 cell_iommu_get_ioid(struct device_node *np)
 	return *ioid;
 }
 
+static struct iommu_table_ops cell_iommu_ops = {
+	.set = tce_build_cell,
+	.clear = tce_free_cell
+};
+
 static struct iommu_window * __init
 cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 			unsigned long offset, unsigned long size,
@@ -492,7 +497,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
 		(offset >> window->table.it_page_shift) + pte_offset;
 	window->table.it_size = size >> window->table.it_page_shift;
 
-	iommu_init_table(&window->table, iommu->nid);
+	iommu_init_table(&window->table, iommu->nid, &cell_iommu_ops);
 
 	pr_debug("\tioid      %d\n", window->ioid);
 	pr_debug("\tblocksize %ld\n", window->table.it_blocksize);
@@ -1199,8 +1204,6 @@ static int __init cell_iommu_init(void)
 	/* Setup various ppc_md. callbacks */
 	ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
 	ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
-	ppc_md.tce_build = tce_build_cell;
-	ppc_md.tce_free = tce_free_cell;
 
 	if (!iommu_fixed_disabled && cell_iommu_fixed_mapping_init() == 0)
 		goto bail;
diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
index 2e576f2..eac33f4 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -132,6 +132,10 @@ static void iobmap_free(struct iommu_table *tbl, long index,
 	}
 }
 
+static struct iommu_table_ops iommu_table_iobmap_ops = {
+	.set = iobmap_build,
+	.clear  = iobmap_free
+};
 
 static void iommu_table_iobmap_setup(void)
 {
@@ -151,7 +155,7 @@ static void iommu_table_iobmap_setup(void)
 	 * Should probably be 8 (64 bytes)
 	 */
 	iommu_table_iobmap.it_blocksize = 4;
-	iommu_init_table(&iommu_table_iobmap, 0);
+	iommu_init_table(&iommu_table_iobmap, 0, &iommu_table_iobmap_ops);
 	pr_debug(" <- %s\n", __func__);
 }
 
@@ -250,8 +254,6 @@ void __init iommu_init_early_pasemi(void)
 
 	ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pasemi;
 	ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pasemi;
-	ppc_md.tce_build = iobmap_build;
-	ppc_md.tce_free  = iobmap_free;
 	set_pci_dma_ops(&dma_iommu_ops);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f828c57..7482518 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -691,7 +691,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_FREE   |
 				 TCE_PCI_SWINV_PAIR);
 	}
-	iommu_init_table(tbl, phb->hose->node);
+	iommu_init_table(tbl, phb->hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, pe, &pnv_pci_ioda1_ops,
 			phb->hose->global_number, pe->pe_number);
 
@@ -831,7 +831,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				8);
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
-	iommu_init_table(tbl, phb->hose->node);
+	iommu_init_table(tbl, phb->hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, pe, &pnv_pci_ioda2_ops,
 			phb->hose->global_number, pe->pe_number);
 
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index b79066d..ea97414 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,7 +87,8 @@ static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
 	if (phb->p5ioc2.iommu_table.it_map == NULL) {
-		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
+		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node,
+				&pnv_iommu_ops);
 		iommu_register_group(&phb->p5ioc2.iommu_table,
 				NULL, NULL,
 				pci_domain_nr(phb->hose->bus), phb->opal_id);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index cc54e3b..1179c63 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,18 +628,11 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 	return ((u64 *)tbl->it_base)[index - tbl->it_offset];
 }
 
-static int pnv_tce_build_rm(struct iommu_table *tbl, long index, long npages,
-			    unsigned long uaddr,
-			    enum dma_data_direction direction,
-			    struct dma_attrs *attrs)
-{
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs, true);
-}
-
-static void pnv_tce_free_rm(struct iommu_table *tbl, long index, long npages)
-{
-	pnv_tce_free(tbl, index, npages, true);
-}
+struct iommu_table_ops pnv_iommu_ops = {
+	.set = pnv_tce_build_vm,
+	.clear = pnv_tce_free_vm,
+	.get = pnv_tce_get,
+};
 
 void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 			       void *tce_mem, u64 tce_size,
@@ -673,7 +666,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 		return NULL;
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0, IOMMU_PAGE_SHIFT_4K);
-	iommu_init_table(tbl, hose->node);
+	iommu_init_table(tbl, hose->node, &pnv_iommu_ops);
 	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
@@ -816,11 +809,6 @@ void __init pnv_pci_init(void)
 
 	/* Configure IOMMU DMA hooks */
 	ppc_md.pci_dma_dev_setup = pnv_pci_dma_dev_setup;
-	ppc_md.tce_build = pnv_tce_build_vm;
-	ppc_md.tce_free = pnv_tce_free_vm;
-	ppc_md.tce_build_rm = pnv_tce_build_rm;
-	ppc_md.tce_free_rm = pnv_tce_free_rm;
-	ppc_md.tce_get = pnv_tce_get;
 	ppc_md.pci_probe_mode = pnv_pci_probe_mode;
 	set_pci_dma_ops(&dma_iommu_ops);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 32847a5..ab85743 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -200,6 +200,7 @@ extern struct pci_ops pnv_pci_ops;
 #ifdef CONFIG_EEH
 extern struct pnv_eeh_ops ioda_eeh_ops;
 #endif
+extern struct iommu_table_ops pnv_iommu_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a047754..793f002 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -193,7 +193,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	int ret = 0;
 	unsigned long flags;
 
-	if (npages == 1) {
+	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 		                           direction, attrs);
 	}
@@ -285,6 +285,9 @@ static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long n
 {
 	u64 rc;
 
+	if (!firmware_has_feature(FW_FEATURE_MULTITCE))
+		return tce_free_pSeriesLP(tbl, tcenum, npages);
+
 	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
 
 	if (rc && printk_ratelimit()) {
@@ -460,7 +463,6 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
 	return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
-
 #ifdef CONFIG_PCI
 static void iommu_table_setparms(struct pci_controller *phb,
 				 struct device_node *dn,
@@ -546,6 +548,12 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb,
 	tbl->it_size = size >> tbl->it_page_shift;
 }
 
+struct iommu_table_ops iommu_table_pseries_ops = {
+	.set = tce_build_pSeries,
+	.clear = tce_free_pSeries,
+	.get = tce_get_pseries
+};
+
 static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 {
 	struct device_node *dn;
@@ -615,7 +623,8 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 			   pci->phb->node);
 
 	iommu_table_setparms(pci->phb, dn, tbl);
-	pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+	pci->iommu_table = iommu_init_table(tbl, pci->phb->node,
+			&iommu_table_pseries_ops);
 	iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 
 	/* Divide the rest (1.75GB) among the children */
@@ -626,6 +635,11 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 	pr_debug("ISA/IDE, window size is 0x%llx\n", pci->phb->dma_window_size);
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+	.set = tce_buildmulti_pSeriesLP,
+	.clear = tce_freemulti_pSeriesLP,
+	.get = tce_get_pSeriesLP
+};
 
 static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 {
@@ -660,7 +674,8 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   ppci->phb->node);
 		iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
-		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
+		ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node,
+				&iommu_table_lpar_multi_ops);
 		iommu_register_group(tbl, NULL, NULL, pci_domain_nr(bus), 0);
 		pr_debug("  created table: %p\n", ppci->iommu_table);
 	}
@@ -687,7 +702,8 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   phb->node);
 		iommu_table_setparms(phb, dn, tbl);
-		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
+		PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node,
+				&iommu_table_pseries_ops);
 		iommu_register_group(tbl, NULL, NULL,
 				pci_domain_nr(phb->bus), 0);
 		set_iommu_table_base_and_group(&dev->dev,
@@ -1104,7 +1120,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 		tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL,
 				   pci->phb->node);
 		iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
-		pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+		pci->iommu_table = iommu_init_table(tbl, pci->phb->node,
+				&iommu_table_lpar_multi_ops);
 		iommu_register_group(tbl, NULL, NULL,
 				pci_domain_nr(pci->phb->bus), 0);
 		pr_debug("  created table: %p\n", pci->iommu_table);
@@ -1289,22 +1306,11 @@ void iommu_init_early_pSeries(void)
 		return;
 
 	if (firmware_has_feature(FW_FEATURE_LPAR)) {
-		if (firmware_has_feature(FW_FEATURE_MULTITCE)) {
-			ppc_md.tce_build = tce_buildmulti_pSeriesLP;
-			ppc_md.tce_free	 = tce_freemulti_pSeriesLP;
-		} else {
-			ppc_md.tce_build = tce_build_pSeriesLP;
-			ppc_md.tce_free	 = tce_free_pSeriesLP;
-		}
-		ppc_md.tce_get   = tce_get_pSeriesLP;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
 		ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
 		ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
 	} else {
-		ppc_md.tce_build = tce_build_pSeries;
-		ppc_md.tce_free  = tce_free_pSeries;
-		ppc_md.tce_get   = tce_get_pseries;
 		ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeries;
 		ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeries;
 	}
@@ -1322,8 +1328,6 @@ static int __init disable_multitce(char *str)
 	    firmware_has_feature(FW_FEATURE_LPAR) &&
 	    firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		printk(KERN_INFO "Disabling MULTITCE firmware feature\n");
-		ppc_md.tce_build = tce_build_pSeriesLP;
-		ppc_md.tce_free	 = tce_free_pSeriesLP;
 		powerpc_firmware_features &= ~FW_FEATURE_MULTITCE;
 	}
 	return 1;
diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
index 9e5353f..27721f5 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -286,6 +286,12 @@ static int __init dart_init(struct device_node *dart_node)
 	return 0;
 }
 
+static struct iommu_table_ops iommu_dart_ops = {
+	.set = dart_build,
+	.clear = dart_free,
+	.flush = dart_flush,
+};
+
 static void iommu_table_dart_setup(void)
 {
 	iommu_table_dart.it_busno = 0;
@@ -298,7 +304,7 @@ static void iommu_table_dart_setup(void)
 	iommu_table_dart.it_base = (unsigned long)dart_vbase;
 	iommu_table_dart.it_index = 0;
 	iommu_table_dart.it_blocksize = 1;
-	iommu_init_table(&iommu_table_dart, -1);
+	iommu_init_table(&iommu_table_dart, -1, &iommu_dart_ops);
 
 	/* Reserve the last page of the DART to avoid possible prefetch
 	 * past the DART mapped area
@@ -386,11 +392,6 @@ void __init iommu_init_early_dart(void)
 	if (dart_init(dn) != 0)
 		goto bail;
 
-	/* Setup low level TCE operations for the core IOMMU code */
-	ppc_md.tce_build = dart_build;
-	ppc_md.tce_free  = dart_free;
-	ppc_md.tce_flush = dart_flush;
-
 	/* Setup bypass if supported */
 	if (dart_is_u4)
 		ppc_md.dma_set_mask = dart_dma_set_mask;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 11/16] powerpc/powernv: Release replaced TCE
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 10/16] powerpc: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-08-06  6:25   ` Benjamin Herrenschmidt
                     ` (2 more replies)
  2014-07-30  9:31 ` [PATCH v4 12/16] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  15 siblings, 3 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

This adds a set_and_get() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This makes iommu_tce_build() put pages returned by set_and_get().

Since now we depend on permission bits in TCE entries, this preserves
those bits in TCE in iommu_put_tce_user_mode().

This removes use of pool locks as those locks serve for TCE allocations
rathen than IOMMU table access and new set_and_get() callback provides
lockless way of safe pages release.

This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not
implement set_and_get() callback. Therefore the "powernv" platform is
the only supported one.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* this is merge+rework of
	powerpc/powernv: Return non-zero TCE from pnv_tce_build
	powerpc/iommu: Implement put_page() if TCE had non-zero value
	powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values
---
 arch/powerpc/include/asm/iommu.h     |  6 ++++++
 arch/powerpc/kernel/iommu.c          | 28 +++++++++++++++-------------
 arch/powerpc/platforms/powernv/pci.c | 29 +++++++++++++++++++++++------
 3 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c725e4a..4b13e4e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
 			unsigned long uaddr,
 			enum dma_data_direction direction,
 			struct dma_attrs *attrs);
+	int (*set_and_get)(struct iommu_table *tbl,
+			long index, long npages,
+			unsigned long uaddr,
+			unsigned long *old_tces,
+			enum dma_data_direction direction,
+			struct dma_attrs *attrs);
 	void (*clear)(struct iommu_table *tbl,
 			long index, long npages);
 	unsigned long (*get)(struct iommu_table *tbl, long index);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 6a86788..ad52e00 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1007,9 +1007,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 {
 	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
-
-	spin_lock(&(pool->lock));
 
 	oldtce = tbl->it_ops->get(tbl, entry);
 	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
@@ -1017,8 +1014,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
 	else
 		oldtce = 0;
 
-	spin_unlock(&(pool->lock));
-
 	return oldtce;
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce);
@@ -1056,16 +1051,12 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
 {
 	int ret = -EBUSY;
 	unsigned long oldtce;
-	struct iommu_pool *pool = get_pool(tbl, entry);
 
-	spin_lock(&(pool->lock));
+	ret = tbl->it_ops->set_and_get(tbl, entry, 1, hwaddr, &oldtce,
+			direction, NULL);
 
-	oldtce = tbl->it_ops->get(tbl, entry);
-	/* Add new entry if it is not busy */
-	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
-		ret = tbl->it_ops->set(tbl, entry, 1, hwaddr, direction, NULL);
-
-	spin_unlock(&(pool->lock));
+	if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
+		put_page(pfn_to_page(__pa(oldtce) >> PAGE_SHIFT));
 
 	/* if (unlikely(ret))
 		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
@@ -1092,6 +1083,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
 		return -EFAULT;
 	}
 	hwaddr = (unsigned long) page_address(page) + offset;
+	hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
 
 	ret = iommu_tce_build(tbl, entry, hwaddr, direction);
 	if (ret)
@@ -1110,6 +1102,16 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 	int ret = 0, bit0 = 0;
 
+	/*
+	 * VFIO does not control TCE entries allocation and the guest
+	 * can write new TCEs on top of existing ones so iommu_tce_build()
+	 * must be able to release old pages. This functionality
+	 * requires set_and_get() callback defined so if it is not
+	 * implemented, we disallow taking ownership over the table.
+	 */
+	if (!tbl->it_ops->set_and_get)
+		return -EINVAL;
+
 	spin_lock_irqsave(&tbl->large_pool.lock, flags);
 	for (i = 0; i < tbl->nr_pools; i++)
 		spin_lock(&tbl->pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 1179c63..629d443 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -572,12 +572,14 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, __be64 *startp,
 }
 
 static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
-			 unsigned long uaddr, enum dma_data_direction direction,
+			 unsigned long uaddr, unsigned long *old_tces,
+			 enum dma_data_direction direction,
 			 struct dma_attrs *attrs, bool rm)
 {
 	u64 proto_tce;
 	__be64 *tcep, *tces;
 	u64 rpn;
+	long i;
 
 	proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -587,9 +589,13 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 	tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
 	rpn = __pa(uaddr) >> tbl->it_page_shift;
 
-	while (npages--)
-		*(tcep++) = cpu_to_be64(proto_tce |
-				(rpn++ << tbl->it_page_shift));
+	for (i = 0; i < npages; i++) {
+		unsigned long oldtce = xchg(tcep, cpu_to_be64(proto_tce |
+				(rpn++ << tbl->it_page_shift)));
+		if (old_tces)
+			old_tces[i] = (unsigned long) __va(oldtce);
+		tcep++;
+	}
 
 	pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
 
@@ -601,8 +607,18 @@ static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
 			    enum dma_data_direction direction,
 			    struct dma_attrs *attrs)
 {
-	return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
-			false);
+	return pnv_tce_build(tbl, index, npages, uaddr, NULL, direction,
+			attrs, false);
+}
+
+static int pnv_tce_set_and_get_vm(struct iommu_table *tbl, long index,
+				  long npages,
+				  unsigned long uaddr, unsigned long *old_tces,
+				  enum dma_data_direction direction,
+				  struct dma_attrs *attrs)
+{
+	return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
+			attrs, false);
 }
 
 static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
@@ -630,6 +646,7 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 
 struct iommu_table_ops pnv_iommu_ops = {
 	.set = pnv_tce_build_vm,
+	.set_and_get = pnv_tce_set_and_get_vm,
 	.clear = pnv_tce_free_vm,
 	.get = pnv_tce_get,
 };
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 12/16] powerpc/pseries/lpar: Enable VFIO
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 13/16] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

The previous patch introduced iommu_table_ops::set_and_get() callback
which effectively disabled VFIO on pseries. This implements set_and_get()
for pseries/lpar so VFIO can work under pHyp again.

Since set_and_get() callback must return old TCE, it has to do H_GET_TCE
for every TCE being replaced, therefore VFIO's performance under pHyp
is expected to be slow.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 793f002..d3cded1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
+				unsigned long *old_tces,
 				enum dma_data_direction direction,
 				struct dma_attrs *attrs)
 {
 	u64 rc = 0;
 	u64 proto_tce, tce;
 	u64 rpn;
-	int ret = 0;
+	int ret = 0, i = 0;
 	long tcenum_start = tcenum, npages_start = npages;
 
 	rpn = __pa(uaddr) >> TCE_SHIFT;
@@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	while (npages--) {
 		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+		if (old_tces)
+			plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12,
+					&old_tces[i++]);
 		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
 		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_set_and_get_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
+				     unsigned long *old_tces,
 				     enum dma_data_direction direction,
 				     struct dma_attrs *attrs)
 {
@@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 
 	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					   old_tces,
 		                           direction, attrs);
 	}
 
@@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		if (!tcep) {
 			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+					    old_tces,
 					    direction, attrs);
 		}
 		__get_cpu_var(tce_page) = tcep;
@@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		for (l = 0; l < limit; l++) {
 			tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT);
 			rpn++;
+			if (old_tces)
+				plpar_tce_get((u64)tbl->it_index,
+						(u64)(tcenum + l) << 12,
+						&old_tces[tcenum + l]);
 		}
 
 		rc = plpar_tce_put_indirect((u64)tbl->it_index,
@@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 	return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+				     long npages, unsigned long uaddr,
+				     enum dma_data_direction direction,
+				     struct dma_attrs *attrs)
+{
+	return tce_set_and_get_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+			direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
 {
 	u64 rc;
@@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
 	.set = tce_buildmulti_pSeriesLP,
+	.set_and_get = tce_set_and_get_pSeriesLP,
 	.clear = tce_freemulti_pSeriesLP,
 	.get = tce_get_pSeriesLP
 };
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 13/16] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 12/16] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 14/16] vfio: powerpc/spapr: Reuse locked_vm accounting helpers Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

SPAPR defines an interface to create additional DMA windows dynamically.
"Dynamically" means that the window is not allocated at the guest start
and the guest can request it later. In practice, existing linux guests
check for the capability and if it is there, they create+map one big DMA
window as big as the entire guest RAM.

SPAPR defines 4 RTAS calls for this feature which userspace implements.
This adds 4 callbacks into the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);
2. create - ibm,create-pe-dma-window - creates a window;
3. remove - ibm,remove-pe-dma-window - removes a window; only additional
window created by create() can be removed, the default 32bit window cannot
be removed as guests do not expect new windows to start from zero;
4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; now it only removes the additional window if it
was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
pass RTAS call from the userspace to the IODA code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/tce.h            |  21 ++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 158 +++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.h      |   2 +
 3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 5ee4987..583463b 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -60,6 +60,27 @@ struct spapr_tce_iommu_ops {
 			phys_addr_t addr);
 	void (*take_ownership)(struct spapr_tce_iommu_group *data,
 			bool enable);
+
+	/* Dynamic DMA window */
+	/* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K       0x01
+#define DDW_PGSIZE_64K      0x02
+#define DDW_PGSIZE_16M      0x04
+#define DDW_PGSIZE_32M      0x08
+#define DDW_PGSIZE_64M      0x10
+#define DDW_PGSIZE_128M     0x20
+#define DDW_PGSIZE_256M     0x40
+#define DDW_PGSIZE_16G      0x80
+	long (*query)(struct spapr_tce_iommu_group *data,
+			__u32 *windows_available,
+			__u32 *page_size_mask);
+	long (*create)(struct spapr_tce_iommu_group *data,
+			__u32 page_shift,
+			__u32 window_shift,
+			struct iommu_table **ptbl);
+	long (*remove)(struct spapr_tce_iommu_group *data,
+			struct iommu_table *tbl);
+	long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7482518..6a847b2 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -754,6 +754,24 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+		struct spapr_tce_iommu_group *data,
+		phys_addr_t addr)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (addr == TCE_DEFAULT_WINDOW)
+		return &pe->tce32.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce64.table, addr))
+		return &pe->tce64.table;
+
+	if (pnv_pci_ioda_check_addr(&pe->tce32.table, addr))
+		return &pe->tce32.table;
+
+	return NULL;
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 				     bool enable)
 {
@@ -762,9 +780,147 @@ static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 	pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+		__u32 *windows_available, __u32 *page_size_mask)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	if (pe->tce64_active) {
+		*page_size_mask = 0;
+		*windows_available = 0;
+	} else {
+		*page_size_mask =
+			DDW_PGSIZE_4K |
+			DDW_PGSIZE_64K |
+			DDW_PGSIZE_16M;
+		*windows_available = 1;
+	}
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+		__u32 page_shift, __u32 window_shift,
+		struct iommu_table **ptbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	struct page *tce_mem = NULL;
+	void *addr;
+	long ret;
+	unsigned long tce_table_size =
+			(1ULL << (window_shift - page_shift)) * 8;
+	unsigned order;
+	struct iommu_table *tbl64 = &pe->tce64.table;
+
+	if ((page_shift != 12) && (page_shift != 16) && (page_shift != 24))
+		return -EINVAL;
+
+	if (window_shift > (memory_hotplug_max() >> page_shift))
+		return -EINVAL;
+
+	if (pe->tce64_active)
+		return -EBUSY;
+
+	tce_table_size = max(0x1000UL, tce_table_size);
+	order = get_order(tce_table_size);
+
+	pe_info(pe, "Setting up DDW at %llx..%llx ws=0x%x ps=0x%x table_size=0x%lx order=0x%x\n",
+			pe->tce_bypass_base,
+			pe->tce_bypass_base + (1ULL << window_shift) - 1,
+			window_shift, page_shift, tce_table_size, order);
+
+	tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, order);
+	if (!tce_mem) {
+		pe_err(pe, " Failed to allocate a DDW\n");
+		return -EFAULT;
+	}
+	addr = page_address(tce_mem);
+	memset(addr, 0, tce_table_size);
+
+	/* Configure HW */
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1, /* Window number */
+			1,
+			__pa(addr),
+			tce_table_size,
+			1 << page_shift);
+	if (ret) {
+		pe_err(pe, " Failed to configure 32-bit TCE table, err %ld\n",
+				ret);
+		return -EFAULT;
+	}
+
+	/* Setup linux iommu table */
+	pnv_pci_setup_iommu_table(tbl64, addr, tce_table_size,
+			pe->tce_bypass_base, page_shift);
+	pe->tce64.pe = pe;
+
+	/* Copy "invalidate" register address */
+	tbl64->it_index = pe->tce32.table.it_index;
+	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
+			TCE_PCI_SWINV_PAIR;
+	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
+	tbl64->it_ops = pe->tce32.table.it_ops;
+
+	*ptbl = tbl64;
+
+	pe->tce64_active = true;
+
+	return 0;
+}
+
+static long pnv_pci_ioda2_ddw_remove(struct spapr_tce_iommu_group *data,
+		struct iommu_table *tbl)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+	struct pnv_phb *phb = pe->phb;
+	long ret;
+
+	/* Only additional 64bit window removal is supported */
+	if ((tbl != &pe->tce64.table) || !pe->tce64_active)
+		return -EFAULT;
+
+	pe_info(pe, "Removing huge 64bit DMA window\n");
+
+	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+	pe->tce64_active = false;
+
+	ret = opal_pci_map_pe_dma_window(phb->opal_id,
+			pe->pe_number,
+			(pe->pe_number << 1) + 1,
+			0/* levels */, 0/* table address */,
+			0/* table size */, 0/* page size */);
+	if (ret)
+		pe_warn(pe, "Unmapping failed, ret = %ld\n", ret);
+
+	free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+	memset(&pe->tce64, 0, sizeof(pe->tce64));
+
+	return ret;
+}
+
+static long pnv_pci_ioda2_ddw_reset(struct spapr_tce_iommu_group *data)
+{
+	struct pnv_ioda_pe *pe = data->iommu_owner;
+
+	pe_info(pe, "Reset DMA windows\n");
+
+	if (!pe->tce64_active)
+		return 0;
+
+	return pnv_pci_ioda2_ddw_remove(data, &pe->tce64.table);
+}
+
 static struct spapr_tce_iommu_ops pnv_pci_ioda2_ops = {
-	.get_table = pnv_ioda1_iommu_get_table,
+	.get_table = pnv_ioda2_iommu_get_table,
 	.take_ownership = pnv_ioda2_take_ownership,
+	.query = pnv_pci_ioda2_ddw_query,
+	.create = pnv_pci_ioda2_ddw_create,
+	.remove = pnv_pci_ioda2_ddw_remove,
+	.reset = pnv_pci_ioda2_ddw_reset
 };
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index ab85743..f25f633 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -64,6 +64,8 @@ struct pnv_ioda_pe {
 	int			tce32_segcount;
 	struct pnv_iommu_table	tce32;
 	phys_addr_t		tce_inval_reg_phys;
+	bool			tce64_active;
+	struct pnv_iommu_table	tce64;
 
 	/* 64-bit TCE bypass region */
 	bool			tce_bypass_enabled;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 14/16] vfio: powerpc/spapr: Reuse locked_vm accounting helpers
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (12 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 13/16] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 15/16] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

There are helpers to account locked pages in locked_vm counter, this
reuses these helpers in VFIO-SPAPR-IOMMU driver.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than entire guest RAM.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* added comment explaining how big the ulimit should be
* used try_increment_locked_vm/decrement_locked_vm
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d9845af..6ed0fc3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -58,7 +58,6 @@ static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data,
 static int tce_iommu_enable(struct tce_container *container)
 {
 	int ret = 0;
-	unsigned long locked, lock_limit, npages;
 	struct iommu_table *tbl;
 	struct spapr_tce_iommu_group *data;
 
@@ -92,24 +91,24 @@ static int tce_iommu_enable(struct tce_container *container)
 	 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 	 * that would effectively kill the guest at random points, much better
 	 * enforcing the limit based on the max that the guest can map.
+	 *
+	 * Unfortunately at the moment it counts whole tables, no matter how
+	 * much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+	 * each with 2GB DMA window, 8GB will be counted here. The reason for
+	 * this is that we cannot tell here the amount of RAM used by the guest
+	 * as this information is only available from KVM and VFIO is
+	 * KVM agnostic.
 	 */
 	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
 	if (!tbl)
 		return -ENXIO;
 
-	down_write(&current->mm->mmap_sem);
-	npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	locked = current->mm->locked_vm + npages;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-		pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-				rlimit(RLIMIT_MEMLOCK));
-		ret = -ENOMEM;
-	} else {
-		current->mm->locked_vm += npages;
-		container->enabled = true;
-	}
-	up_write(&current->mm->mmap_sem);
+	ret = try_increment_locked_vm((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >>
+			PAGE_SHIFT);
+	if (ret)
+		return ret;
+
+	container->enabled = true;
 
 	return ret;
 }
@@ -135,10 +134,8 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!tbl)
 		return;
 
-	down_write(&current->mm->mmap_sem);
-	current->mm->locked_vm -= (tbl->it_size <<
-			IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-	up_write(&current->mm->mmap_sem);
+	decrement_locked_vm((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >>
+			PAGE_SHIFT);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 15/16] vfio: powerpc/spapr: Use it_page_size
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (13 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 14/16] vfio: powerpc/spapr: Reuse locked_vm accounting helpers Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:31 ` [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
  15 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 6ed0fc3..48b256c 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -103,7 +103,7 @@ static int tce_iommu_enable(struct tce_container *container)
 	if (!tbl)
 		return -ENXIO;
 
-	ret = try_increment_locked_vm((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >>
+	ret = try_increment_locked_vm((tbl->it_size << tbl->it_page_shift) >>
 			PAGE_SHIFT);
 	if (ret)
 		return ret;
@@ -134,7 +134,7 @@ static void tce_iommu_disable(struct tce_container *container)
 	if (!tbl)
 		return;
 
-	decrement_locked_vm((tbl->it_size << IOMMU_PAGE_SHIFT_4K) >>
+	decrement_locked_vm((tbl->it_size << tbl->it_page_shift) >>
 			PAGE_SHIFT);
 }
 
@@ -207,8 +207,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (info.argsz < minsz)
 			return -EINVAL;
 
-		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
@@ -261,17 +261,17 @@ static long tce_iommu_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
-		for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
+		for (i = 0; i < (param.size >> tbl->it_page_shift); ++i) {
 			ret = iommu_put_tce_user_mode(tbl,
-					(param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
+					(param.iova >> tbl->it_page_shift) + i,
 					tce);
 			if (ret)
 				break;
-			tce += IOMMU_PAGE_SIZE_4K;
+			tce += IOMMU_PAGE_SIZE(tbl);
 		}
 		if (ret)
 			iommu_clear_tces_and_put_pages(tbl,
-					param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+					param.iova >> tbl->it_page_shift, i);
 
 		iommu_flush_tce(tbl);
 
@@ -312,13 +312,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 		BUG_ON(!tbl->it_group);
 
 		ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.size >> tbl->it_page_shift);
 		if (ret)
 			return ret;
 
 		ret = iommu_clear_tces_and_put_pages(tbl,
-				param.iova >> IOMMU_PAGE_SHIFT_4K,
-				param.size >> IOMMU_PAGE_SHIFT_4K);
+				param.iova >> tbl->it_page_shift,
+				param.size >> tbl->it_page_shift);
 		iommu_flush_tce(tbl);
 
 		return ret;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows
  2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
                   ` (14 preceding siblings ...)
  2014-07-30  9:31 ` [PATCH v4 15/16] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
@ 2014-07-30  9:31 ` Alexey Kardashevskiy
  2014-07-30  9:36   ` Alexey Kardashevskiy
  15 siblings, 1 reply; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:31 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Michael Ellerman, Paul Mackerras, Gavin Shan

This defines and implements VFIO IOMMU API required to support
Dynamic DMA windows defined in the SPAPR specification. The ioctl handlers
implement host-size part of corresponding RTAS calls:
- VFIO_IOMMU_SPAPR_TCE_QUERY - ibm,query-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_CREATE - ibm,create-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_REMOVE - ibm,remove-pe-dma-window;
- VFIO_IOMMU_SPAPR_TCE_RESET - ibm,reset-pe-dma-window.

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls reset() when IOMMU is being disabled (happens when VFIO stops
using it).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   1 +
 drivers/vfio/vfio_iommu_spapr_tce.c       | 173 +++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h                 |  37 ++++++-
 3 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6a847b2..f51afe2 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -859,6 +859,7 @@ static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
 
 	/* Copy "invalidate" register address */
 	tbl64->it_index = pe->tce32.table.it_index;
+	tbl64->it_group = pe->tce32.table.it_group;
 	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
 			TCE_PCI_SWINV_PAIR;
 	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 48b256c..32e2804 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -45,6 +45,7 @@ struct tce_container {
 	struct mutex lock;
 	struct iommu_group *grp;
 	bool enabled;
+	unsigned long start64;
 };
 
 
@@ -123,19 +124,36 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	container->enabled = false;
 
-	if (!container->grp || !current->mm)
+	if (!container->grp)
 		return;
 
 	data = iommu_group_get_iommudata(container->grp);
 	if (!data || !data->iommu_owner || !data->ops->get_table)
 		return;
 
+	/* Try resetting, there might have been a 64bit window */
+	if (data->ops->reset)
+		data->ops->reset(data);
+
+	if (!current->mm)
+		return;
+
 	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
 	if (!tbl)
 		return;
 
 	decrement_locked_vm((tbl->it_size << tbl->it_page_shift) >>
 			PAGE_SHIFT);
+
+	if (!container->start64)
+		return;
+
+	tbl = data->ops->get_table(data, container->start64);
+	if (!tbl)
+		return;
+
+	decrement_locked_vm((tbl->it_size << tbl->it_page_shift) >>
+			PAGE_SHIFT);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -210,6 +228,8 @@ static long tce_iommu_ioctl(void *iommu_data,
 		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
 		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
 		info.flags = 0;
+		if (data->ops->query && data->ops->create && data->ops->remove)
+			info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
 
 		if (copy_to_user((void __user *)arg, &info, minsz))
 			return -EFAULT;
@@ -335,6 +355,157 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
+
+	case VFIO_IOMMU_SPAPR_TCE_QUERY: {
+		struct vfio_iommu_spapr_tce_query query;
+		struct spapr_tce_iommu_group *data;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_query,
+				page_size_mask);
+
+		if (copy_from_user(&query, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (query.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->query || !data->iommu_owner)
+			return -ENOSYS;
+
+		ret = data->ops->query(data,
+				&query.windows_available,
+				&query.page_size_mask);
+
+		if (ret)
+			return ret;
+
+		if (copy_to_user((void __user *)arg, &query, minsz))
+			return -EFAULT;
+
+		return 0;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+		struct vfio_iommu_spapr_tce_create create;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+				start_addr);
+
+		if (copy_from_user(&create, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (create.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->create || !data->iommu_owner)
+			return -ENOSYS;
+
+		BUG_ON(!data || !data->ops || !data->ops->remove);
+
+		ret = data->ops->create(data, create.page_shift,
+				create.window_shift, &tbl);
+		if (ret)
+			return ret;
+
+		ret = try_increment_locked_vm((tbl->it_size <<
+					tbl->it_page_shift) >> PAGE_SHIFT);
+		if (ret) {
+			data->ops->remove(data, tbl);
+			return ret;
+		}
+
+		create.start_addr = tbl->it_offset << tbl->it_page_shift;
+
+		if (copy_to_user((void __user *)arg, &create, minsz)) {
+			data->ops->remove(data, tbl);
+			decrement_locked_vm((tbl->it_size <<
+					tbl->it_page_shift) >> PAGE_SHIFT);
+			return -EFAULT;
+		}
+
+		return ret;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
+		struct vfio_iommu_spapr_tce_remove remove;
+		struct spapr_tce_iommu_group *data;
+		struct iommu_table *tbl;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
+				start_addr);
+
+		if (copy_from_user(&remove, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (remove.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->remove || !data->iommu_owner)
+			return -ENOSYS;
+
+		tbl = data->ops->get_table(data, remove.start_addr);
+		if (!tbl)
+			return -EINVAL;
+
+		ret = data->ops->remove(data, tbl);
+		if (ret)
+			return ret;
+
+		decrement_locked_vm((tbl->it_size << tbl->it_page_shift)
+				>> PAGE_SHIFT);
+		return 0;
+	}
+	case VFIO_IOMMU_SPAPR_TCE_RESET: {
+		struct vfio_iommu_spapr_tce_reset reset;
+		struct spapr_tce_iommu_group *data;
+
+		if (WARN_ON(!container->grp))
+			return -ENXIO;
+
+		data = iommu_group_get_iommudata(container->grp);
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_reset, argsz);
+
+		if (copy_from_user(&reset, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (reset.argsz < minsz)
+			return -EINVAL;
+
+		if (!data->ops->reset || !data->iommu_owner)
+			return -ENOSYS;
+
+		ret = data->ops->reset(data);
+		if (ret)
+			return ret;
+
+		if (container->start64) {
+			struct iommu_table *tbl;
+
+			tbl = data->ops->get_table(data, container->start64);
+			BUG_ON(!tbl);
+
+			decrement_locked_vm((tbl->it_size << tbl->it_page_shift)
+					>> PAGE_SHIFT);
+		}
+
+		return 0;
+	}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..8b03381 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -448,13 +448,48 @@ struct vfio_iommu_type1_dma_unmap {
  */
 struct vfio_iommu_spapr_tce_info {
 	__u32 argsz;
-	__u32 flags;			/* reserved for future use */
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
 	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
 	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
 };
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * Dynamic DMA windows
+ */
+struct vfio_iommu_spapr_tce_query {
+	__u32 argsz;
+	/* out */
+	__u32 windows_available;
+	__u32 page_size_mask;
+};
+#define VFIO_IOMMU_SPAPR_TCE_QUERY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	/* in */
+	__u32 page_shift;
+	__u32 window_shift;
+	/* out */
+	__u64 start_addr;
+
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+struct vfio_iommu_spapr_tce_reset {
+	__u32 argsz;
+};
+#define VFIO_IOMMU_SPAPR_TCE_RESET	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows
  2014-07-30  9:31 ` [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
@ 2014-07-30  9:36   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30  9:36 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Ellerman, Paul Mackerras, Gavin Shan

On 07/30/2014 07:31 PM, Alexey Kardashevskiy wrote:
> This defines and implements VFIO IOMMU API required to support
> Dynamic DMA windows defined in the SPAPR specification. The ioctl handlers
> implement host-size part of corresponding RTAS calls:
> - VFIO_IOMMU_SPAPR_TCE_QUERY - ibm,query-pe-dma-window;
> - VFIO_IOMMU_SPAPR_TCE_CREATE - ibm,create-pe-dma-window;
> - VFIO_IOMMU_SPAPR_TCE_REMOVE - ibm,remove-pe-dma-window;
> - VFIO_IOMMU_SPAPR_TCE_RESET - ibm,reset-pe-dma-window.
> 
> The VFIO IOMMU driver does basic sanity checks and calls corresponding
> SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
> implements them.
> 
> This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
> VFIO_IOMMU_SPAPR_TCE_GET_INFO.
> 
> This calls reset() when IOMMU is being disabled (happens when VFIO stops
> using it).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c       | 173 +++++++++++++++++++++++++++++-
>  include/uapi/linux/vfio.h                 |  37 ++++++-
>  3 files changed, 209 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6a847b2..f51afe2 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -859,6 +859,7 @@ static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
>  
>  	/* Copy "invalidate" register address */
>  	tbl64->it_index = pe->tce32.table.it_index;
> +	tbl64->it_group = pe->tce32.table.it_group;

Just noticed. This does not belong here, this must be moved to earlier patch.


>  	tbl64->it_type = TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE |
>  			TCE_PCI_SWINV_PAIR;
>  	tbl64->it_map = (void *) 0xDEADBEEF; /* poison */
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 48b256c..32e2804 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -45,6 +45,7 @@ struct tce_container {
>  	struct mutex lock;
>  	struct iommu_group *grp;
>  	bool enabled;
> +	unsigned long start64;
>  };
>  
>  
> @@ -123,19 +124,36 @@ static void tce_iommu_disable(struct tce_container *container)
>  
>  	container->enabled = false;
>  
> -	if (!container->grp || !current->mm)
> +	if (!container->grp)
>  		return;
>  
>  	data = iommu_group_get_iommudata(container->grp);
>  	if (!data || !data->iommu_owner || !data->ops->get_table)
>  		return;
>  
> +	/* Try resetting, there might have been a 64bit window */
> +	if (data->ops->reset)
> +		data->ops->reset(data);
> +
> +	if (!current->mm)
> +		return;
> +
>  	tbl = data->ops->get_table(data, TCE_DEFAULT_WINDOW);
>  	if (!tbl)
>  		return;
>  
>  	decrement_locked_vm((tbl->it_size << tbl->it_page_shift) >>
>  			PAGE_SHIFT);
> +
> +	if (!container->start64)
> +		return;
> +
> +	tbl = data->ops->get_table(data, container->start64);
> +	if (!tbl)
> +		return;
> +
> +	decrement_locked_vm((tbl->it_size << tbl->it_page_shift) >>
> +			PAGE_SHIFT);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -210,6 +228,8 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
>  		info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
>  		info.flags = 0;
> +		if (data->ops->query && data->ops->create && data->ops->remove)
> +			info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
>  
>  		if (copy_to_user((void __user *)arg, &info, minsz))
>  			return -EFAULT;
> @@ -335,6 +355,157 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		tce_iommu_disable(container);
>  		mutex_unlock(&container->lock);
>  		return 0;
> +
> +	case VFIO_IOMMU_SPAPR_TCE_QUERY: {
> +		struct vfio_iommu_spapr_tce_query query;
> +		struct spapr_tce_iommu_group *data;
> +
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_query,
> +				page_size_mask);
> +
> +		if (copy_from_user(&query, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (query.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!data->ops->query || !data->iommu_owner)
> +			return -ENOSYS;
> +
> +		ret = data->ops->query(data,
> +				&query.windows_available,
> +				&query.page_size_mask);
> +
> +		if (ret)
> +			return ret;
> +
> +		if (copy_to_user((void __user *)arg, &query, minsz))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_CREATE: {
> +		struct vfio_iommu_spapr_tce_create create;
> +		struct spapr_tce_iommu_group *data;
> +		struct iommu_table *tbl;
> +
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
> +				start_addr);
> +
> +		if (copy_from_user(&create, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (create.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!data->ops->create || !data->iommu_owner)
> +			return -ENOSYS;
> +
> +		BUG_ON(!data || !data->ops || !data->ops->remove);
> +
> +		ret = data->ops->create(data, create.page_shift,
> +				create.window_shift, &tbl);
> +		if (ret)
> +			return ret;
> +
> +		ret = try_increment_locked_vm((tbl->it_size <<
> +					tbl->it_page_shift) >> PAGE_SHIFT);
> +		if (ret) {
> +			data->ops->remove(data, tbl);
> +			return ret;
> +		}
> +
> +		create.start_addr = tbl->it_offset << tbl->it_page_shift;
> +
> +		if (copy_to_user((void __user *)arg, &create, minsz)) {
> +			data->ops->remove(data, tbl);
> +			decrement_locked_vm((tbl->it_size <<
> +					tbl->it_page_shift) >> PAGE_SHIFT);
> +			return -EFAULT;
> +		}
> +
> +		return ret;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_REMOVE: {
> +		struct vfio_iommu_spapr_tce_remove remove;
> +		struct spapr_tce_iommu_group *data;
> +		struct iommu_table *tbl;
> +
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_remove,
> +				start_addr);
> +
> +		if (copy_from_user(&remove, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (remove.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!data->ops->remove || !data->iommu_owner)
> +			return -ENOSYS;
> +
> +		tbl = data->ops->get_table(data, remove.start_addr);
> +		if (!tbl)
> +			return -EINVAL;
> +
> +		ret = data->ops->remove(data, tbl);
> +		if (ret)
> +			return ret;
> +
> +		decrement_locked_vm((tbl->it_size << tbl->it_page_shift)
> +				>> PAGE_SHIFT);
> +		return 0;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_RESET: {
> +		struct vfio_iommu_spapr_tce_reset reset;
> +		struct spapr_tce_iommu_group *data;
> +
> +		if (WARN_ON(!container->grp))
> +			return -ENXIO;
> +
> +		data = iommu_group_get_iommudata(container->grp);
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_reset, argsz);
> +
> +		if (copy_from_user(&reset, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (reset.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!data->ops->reset || !data->iommu_owner)
> +			return -ENOSYS;
> +
> +		ret = data->ops->reset(data);
> +		if (ret)
> +			return ret;
> +
> +		if (container->start64) {
> +			struct iommu_table *tbl;
> +
> +			tbl = data->ops->get_table(data, container->start64);
> +			BUG_ON(!tbl);
> +
> +			decrement_locked_vm((tbl->it_size << tbl->it_page_shift)
> +					>> PAGE_SHIFT);
> +		}
> +
> +		return 0;
> +	}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index cb9023d..8b03381 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -448,13 +448,48 @@ struct vfio_iommu_type1_dma_unmap {
>   */
>  struct vfio_iommu_spapr_tce_info {
>  	__u32 argsz;
> -	__u32 flags;			/* reserved for future use */
> +	__u32 flags;
> +#define VFIO_IOMMU_SPAPR_TCE_FLAG_DDW	1 /* Support dynamic windows */
>  	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
>  	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
>  };
>  
>  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
>  
> +/*
> + * Dynamic DMA windows
> + */
> +struct vfio_iommu_spapr_tce_query {
> +	__u32 argsz;
> +	/* out */
> +	__u32 windows_available;
> +	__u32 page_size_mask;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_QUERY	_IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +struct vfio_iommu_spapr_tce_create {
> +	__u32 argsz;
> +	/* in */
> +	__u32 page_shift;
> +	__u32 window_shift;
> +	/* out */
> +	__u64 start_addr;
> +
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> +struct vfio_iommu_spapr_tce_remove {
> +	__u32 argsz;
> +	/* in */
> +	__u64 start_addr;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +struct vfio_iommu_spapr_tce_reset {
> +	__u32 argsz;
> +};
> +#define VFIO_IOMMU_SPAPR_TCE_RESET	_IO(VFIO_TYPE, VFIO_BASE + 20)
> +
>  /* ***************************************************************** */
>  
>  #endif /* _UAPIVFIO_H */
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 11/16] powerpc/powernv: Release replaced TCE
  2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
@ 2014-08-06  6:25   ` Benjamin Herrenschmidt
  2014-08-06  6:27   ` Benjamin Herrenschmidt
  2014-08-06  6:27   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-06  6:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, Gavin Shan

On Wed, 2014-07-30 at 19:31 +1000, Alexey Kardashevskiy wrote:
> +       if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
> +               put_page(pfn_to_page(__pa(oldtce) >> PAGE_SHIFT));

That probably needs set_page_dirty if TCE_PCI_WRITE is set

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 11/16] powerpc/powernv: Release replaced TCE
  2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
  2014-08-06  6:25   ` Benjamin Herrenschmidt
@ 2014-08-06  6:27   ` Benjamin Herrenschmidt
  2014-08-06  6:27   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-06  6:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, Gavin Shan

On Wed, 2014-07-30 at 19:31 +1000, Alexey Kardashevskiy wrote:
>  
>  static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> -                        unsigned long uaddr, enum dma_data_direction direction,
> +                        unsigned long uaddr, unsigned long *old_tces,
> +                        enum dma_data_direction direction,
>                          struct dma_attrs *attrs, bool rm)
>  {
>         u64 proto_tce;
>         __be64 *tcep, *tces;
>         u64 rpn;
> +       long i;
>  
>         proto_tce = TCE_PCI_READ; // Read allowed
>  
> @@ -587,9 +589,13 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>         tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
>         rpn = __pa(uaddr) >> tbl->it_page_shift;
>  
> -       while (npages--)
> -               *(tcep++) = cpu_to_be64(proto_tce |
> -                               (rpn++ << tbl->it_page_shift));
> +       for (i = 0; i < npages; i++) {
> +               unsigned long oldtce = xchg(tcep, cpu_to_be64(proto_tce |
> +                               (rpn++ << tbl->it_page_shift)));
> +               if (old_tces)
> +                       old_tces[i] = (unsigned long) __va(oldtce);
> +               tcep++;
> +       }

xchg() is slow, please keep separate implementation for build and
set_and_get() to avoid the performance loss on normal TCE host
operations.

Cheers,
Ben.

>         pnv_tce_invalidate(tbl, tces, tcep - 1, rm);
>  
> @@ -601,8 +607,18 @@ static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
>                             enum dma_data_direction direction,
>                             struct dma_attrs *attrs)
>  {
> -       return pnv_tce_build(tbl, index, npages, uaddr, direction, attrs,
> -                       false);
> +       return pnv_tce_build(tbl, index, npages, uaddr, NULL, direction,
> +                       attrs, false);
> +}
> +
> +static int pnv_tce_set_and_get_vm(struct iommu_table *tbl, long index,
> +                                 long npages,
> +                                 unsigned long uaddr, unsigned long *old_tces,
> +                                 enum dma_data_direction direction,
> +                                 struct dma_attrs *attrs)
> +{
> +       return pnv_tce_build(tbl, index, npages, uaddr, old_tces, direction,
> +                       attrs, false);
>  }
>  
>  static void pnv_tce_free(struct iommu_table *tbl, long index, long npages,
> @@ -630,6 +646,7 @@ static unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
>  
>  struct iommu_table_ops pnv_iommu_ops = {
>         .set = pnv_tce_build_vm,
> +       .set_and_get = pnv_tce_set_and_get_vm,
>         .clear = pnv_tce_free_vm,
>         .get = pnv_tce_get,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 11/16] powerpc/powernv: Release replaced TCE
  2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
  2014-08-06  6:25   ` Benjamin Herrenschmidt
  2014-08-06  6:27   ` Benjamin Herrenschmidt
@ 2014-08-06  6:27   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-06  6:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, Gavin Shan

On Wed, 2014-07-30 at 19:31 +1000, Alexey Kardashevskiy wrote:
> 
> This adds a set_and_get() callback to iommu_table_ops which does the
> same
> thing as set() plus it returns replaced TCE(s) so the caller can
> release
> the pages afterwards.

Call it xchg() instead of set_and_get(), it better reflects the
atomicity requirement.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables
  2014-07-30  9:31 ` [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables Alexey Kardashevskiy
@ 2014-08-21  5:25   ` Paul Mackerras
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Mackerras @ 2014-08-21  5:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Michael Ellerman, linuxppc-dev, Gavin Shan

On Wed, Jul 30, 2014 at 07:31:21PM +1000, Alexey Kardashevskiy wrote:
> At the moment spapr_tce_tables is not protected against races. This makes
> use of RCU-variants of list helpers. As some bits are executed in real
> mode, this makes use of just introduced list_for_each_entry_rcu_notrace().
> 
> This converts release_spapr_tce_table() to a RCU scheduled handler.

...

> @@ -106,7 +108,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	int i;
>  
>  	/* Check this LIOBN hasn't been previously allocated */
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> +	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
> +			list) {
>  		if (stt->liobn == args->liobn)
>  			return -EBUSY;
>  	}

You need something to protect this traversal of the list.  In fact...

> @@ -131,7 +134,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	kvm_get_kvm(kvm);
>  
>  	mutex_lock(&kvm->lock);
> -	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> +	list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
>  
>  	mutex_unlock(&kvm->lock);
>  

since it is the kvm->lock mutex that protects the list against
changes, you should do the check for whether the liobn is already in
the list inside the same mutex_lock ... unlock pair that protects the
list_add_rcu.

Or, if there is some other reason why two threads can't race and both
add the same liobn here, you need to explain that (and in that case
it's presumably unnecessary to hold kvm->lock).

> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 89e96b3..b1914d9 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -50,7 +50,8 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
>  
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> +	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
> +			list) {
>  		if (stt->liobn == liobn) {
>  			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>  			struct page *page;
> @@ -82,7 +83,8 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvmppc_spapr_tce_table *stt;
>  
> -	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> +	list_for_each_entry_rcu_notrace(stt, &kvm->arch.spapr_tce_tables,
> +			list) {
>  		if (stt->liobn == liobn) {
>  			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
>  			struct page *page;

What protects these list iterations?  Do we know that interrupts are
always disabled here, or that the caller has done an rcu_read_lock or
something?  It seems a little odd that you haven't added any calls to
rcu_read_lock/unlock().

Paul.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-08-21  7:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-30  9:31 [PATCH v4 00/16] powernv: vfio: Add Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 01/16] rcu: Define notrace version of list_for_each_entry_rcu and list_entry_rcu Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 02/16] KVM: PPC: Use RCU for arch.spapr_tce_tables Alexey Kardashevskiy
2014-08-21  5:25   ` Paul Mackerras
2014-07-30  9:31 ` [PATCH v4 03/16] mm: Add helpers for locked_vm Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 04/16] KVM: PPC: Account TCE-containing pages in locked_vm Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 05/16] powerpc/iommu: Fix comments with it_page_shift Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 06/16] powerpc/powernv: Make invalidate() a callback Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 07/16] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 08/16] powerpc/powernv: Convert/move set_bypass() callback to take_ownership() Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 09/16] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 10/16] powerpc: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 11/16] powerpc/powernv: Release replaced TCE Alexey Kardashevskiy
2014-08-06  6:25   ` Benjamin Herrenschmidt
2014-08-06  6:27   ` Benjamin Herrenschmidt
2014-08-06  6:27   ` Benjamin Herrenschmidt
2014-07-30  9:31 ` [PATCH v4 12/16] powerpc/pseries/lpar: Enable VFIO Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 13/16] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 14/16] vfio: powerpc/spapr: Reuse locked_vm accounting helpers Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 15/16] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
2014-07-30  9:31 ` [PATCH v4 16/16] vfio: powerpc/spapr: Enable Dynamic DMA windows Alexey Kardashevskiy
2014-07-30  9:36   ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).