All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown
@ 2016-10-24  6:53 Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers Alexey Kardashevskiy
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-24  6:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Nicholas Piggin, Paul Mackerras

These patches are to fix a bug when pages stay pinned hours
after QEMU which requested pinning exited.

Please comment. Thanks.

Alexey Kardashevskiy (4):
  powerpc/iommu: Pass mm_struct to init/cleanup helpers
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  vfio/spapr: Reference mm in tce_container
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

 arch/powerpc/include/asm/mmu_context.h |  20 ++--
 arch/powerpc/kernel/setup-common.c     |   2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
 arch/powerpc/mm/mmu_context_iommu.c    |  60 ++++-------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 181 ++++++++++++++++++++++-----------
 5 files changed, 154 insertions(+), 115 deletions(-)

-- 
2.5.0.rc3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers
  2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
@ 2016-10-24  6:53 ` Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-24  6:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Nicholas Piggin, Paul Mackerras

We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.

This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h | 4 ++--
 arch/powerpc/kernel/setup-common.c     | 2 +-
 arch/powerpc/mm/mmu_context_book3s64.c | 4 ++--
 arch/powerpc/mm/mmu_context_iommu.c    | 9 +++++----
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 5c45114..424844b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 		unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 270ee30..f516ac5 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p)
 	init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_init(&init_mm.context);
+	mm_iommu_init(&init_mm);
 #endif
 	irqstack_early_init();
 	exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_init(&mm->context);
+	mm_iommu_init(mm);
 #endif
 	return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_cleanup(&mm->context);
+	mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index e0f1c33..ad2e575 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
 
-void mm_iommu_init(mm_context_t *ctx)
+void mm_iommu_init(struct mm_struct *mm)
 {
-	INIT_LIST_HEAD_RCU(&ctx->iommu_group_mem_list);
+	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
 }
 
-void mm_iommu_cleanup(mm_context_t *ctx)
+void mm_iommu_cleanup(struct mm_struct *mm)
 {
 	struct mm_iommu_table_group_mem_t *mem, *tmp;
 
-	list_for_each_entry_safe(mem, tmp, &ctx->iommu_group_mem_list, next) {
+	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
+			next) {
 		list_del_rcu(&mem->next);
 		mm_iommu_do_free(mem);
 	}
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH kernel v4 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx
  2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers Alexey Kardashevskiy
@ 2016-10-24  6:53 ` Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-24  6:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Nicholas Piggin, Paul Mackerras

This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/mmu_context.h | 16 ++++++------
 arch/powerpc/mm/mmu_context_iommu.c    | 46 +++++++++++++---------------------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 14 ++++++++---
 3 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 424844b..b9e3f0a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm);
 struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);	/* from internal.h */
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern long mm_iommu_put(struct mm_struct *mm,
+		struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-		unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-		unsigned long entries);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+		unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 		unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index ad2e575..4c6db09 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	}
 
 	pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-			current->pid,
+			current ? current->pid : 0,
 			incr ? '+' : '-',
 			npages << PAGE_SHIFT,
 			mm->locked_vm << PAGE_SHIFT,
@@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
 	return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-	if (!current || !current->mm)
-		return false;
-
-	return !list_empty(&current->mm->context.iommu_group_mem_list);
+	return !list_empty(&mm->context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page *page)
 	return 0;
 }
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
 	struct page *page = NULL;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	mutex_lock(&mem_list_mutex);
 
-	list_for_each_entry_rcu(mem, &current->mm->context.iommu_group_mem_list,
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
 			next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			++mem->used;
@@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
 	}
 
-	ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+	ret = mm_iommu_adjust_locked_vm(mm, entries, true);
 	if (ret)
 		goto unlock_exit;
 
@@ -215,11 +209,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 	mem->entries = entries;
 	*pmem = mem;
 
-	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
+	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
 
 unlock_exit:
 	if (locked_entries && ret)
-		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
+		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
 
 	mutex_unlock(&mem_list_mutex);
 
@@ -264,17 +258,13 @@ static void mm_iommu_free(struct rcu_head *head)
 static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 {
 	list_del_rcu(&mem->next);
-	mm_iommu_adjust_locked_vm(current->mm, mem->entries, false);
 	call_rcu(&mem->rcu, mm_iommu_free);
 }
 
-long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
+long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	mutex_lock(&mem_list_mutex);
 
 	if (mem->used == 0) {
@@ -297,6 +287,8 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
 	/* @mapped became 0 so now mappings are disabled, release the region */
 	mm_iommu_release(mem);
 
+	mm_iommu_adjust_locked_vm(mm, mem->entries, false);
+
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
 
@@ -304,14 +296,12 @@ long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
 
-struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-		unsigned long size)
+struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+		unsigned long ua, unsigned long size)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
-	list_for_each_entry_rcu(mem,
-			&current->mm->context.iommu_group_mem_list,
-			next) {
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua <= ua) &&
 				(ua + size <= mem->ua +
 				 (mem->entries << PAGE_SHIFT))) {
@@ -324,14 +314,12 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
 }
 EXPORT_SYMBOL_GPL(mm_iommu_lookup);
 
-struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-		unsigned long entries)
+struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+		unsigned long ua, unsigned long entries)
 {
 	struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
-	list_for_each_entry_rcu(mem,
-			&current->mm->context.iommu_group_mem_list,
-			next) {
+	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
 		if ((mem->ua == ua) && (mem->entries == entries)) {
 			ret = mem;
 			break;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 80378dd..d0c38b2 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -107,14 +107,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
 
-	mem = mm_iommu_find(vaddr, size >> PAGE_SHIFT);
+	mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
 	if (!mem)
 		return -ENOENT;
 
-	return mm_iommu_put(mem);
+	return mm_iommu_put(current->mm, mem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -124,11 +127,14 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	struct mm_iommu_table_group_mem_t *mem = NULL;
 	unsigned long entries = size >> PAGE_SHIFT;
 
+	if (!current || !current->mm)
+		return -ESRCH; /* process exited */
+
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
-	ret = mm_iommu_get(vaddr, entries, &mem);
+	ret = mm_iommu_get(current->mm, vaddr, entries, &mem);
 	if (ret)
 		return ret;
 
@@ -375,7 +381,7 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(tce, size);
+	mem = mm_iommu_lookup(current->mm, tce, size);
 	if (!mem)
 		return -EINVAL;
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container
  2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers Alexey Kardashevskiy
  2016-10-24  6:53 ` [PATCH kernel v4 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
@ 2016-10-24  6:53 ` Alexey Kardashevskiy
  2016-10-24  7:14   ` Nicholas Piggin
                     ` (2 more replies)
  2016-10-24  6:53 ` [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown Alexey Kardashevskiy
  2016-11-08  7:54 ` [PATCH kernel v4 0/4] powerpc/spapr/vfio: " Michael Ellerman
  4 siblings, 3 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-24  6:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Nicholas Piggin, Paul Mackerras

In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
when a container is just created so checking for !current->mm in other
places becomes pointless.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes; in order to add this check,
VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
quite special anyway - it is the only ioctl() called when neither
container nor container->mm is initialized.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* added check for container->mm!=current->mm in tce_iommu_ioctl()
for all ioctls and removed other redundand checks
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 131 ++++++++++++++++++------------------
 1 file changed, 66 insertions(+), 65 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0c38b2..81ab93f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -31,49 +31,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(long npages)
+static long try_increment_locked_vm(struct mm_struct *mm, long npages)
 {
 	long ret = 0, locked, lock_limit;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	if (!npages)
 		return 0;
 
-	down_write(&current->mm->mmap_sem);
-	locked = current->mm->locked_vm + npages;
+	down_write(&mm->mmap_sem);
+	locked = mm->locked_vm + npages;
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
 		ret = -ENOMEM;
 	else
-		current->mm->locked_vm += npages;
+		mm->locked_vm += npages;
 
 	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
 			npages << PAGE_SHIFT,
-			current->mm->locked_vm << PAGE_SHIFT,
+			mm->locked_vm << PAGE_SHIFT,
 			rlimit(RLIMIT_MEMLOCK),
 			ret ? " - exceeded" : "");
 
-	up_write(&current->mm->mmap_sem);
+	up_write(&mm->mmap_sem);
 
 	return ret;
 }
 
-static void decrement_locked_vm(long npages)
+static void decrement_locked_vm(struct mm_struct *mm, long npages)
 {
-	if (!current || !current->mm || !npages)
-		return; /* process exited */
+	if (!npages)
+		return;
 
-	down_write(&current->mm->mmap_sem);
-	if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-		npages = current->mm->locked_vm;
-	current->mm->locked_vm -= npages;
+	down_write(&mm->mmap_sem);
+	if (WARN_ON_ONCE(npages > mm->locked_vm))
+		npages = mm->locked_vm;
+	mm->locked_vm -= npages;
 	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
 			npages << PAGE_SHIFT,
-			current->mm->locked_vm << PAGE_SHIFT,
+			mm->locked_vm << PAGE_SHIFT,
 			rlimit(RLIMIT_MEMLOCK));
-	up_write(&current->mm->mmap_sem);
+	up_write(&mm->mmap_sem);
 }
 
 /*
@@ -98,6 +95,7 @@ struct tce_container {
 	bool enabled;
 	bool v2;
 	unsigned long locked_pages;
+	struct mm_struct *mm;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
 };
@@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
 
-	mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
+	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
 	if (!mem)
 		return -ENOENT;
 
-	return mm_iommu_put(current->mm, mem);
+	return mm_iommu_put(container->mm, mem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	struct mm_iommu_table_group_mem_t *mem = NULL;
 	unsigned long entries = size >> PAGE_SHIFT;
 
-	if (!current || !current->mm)
-		return -ESRCH; /* process exited */
-
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
-	ret = mm_iommu_get(current->mm, vaddr, entries, &mem);
+	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
 	if (ret)
 		return ret;
 
@@ -143,7 +135,8 @@ static long tce_iommu_register_pages(struct tce_container *container,
 	return 0;
 }
 
-static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
+static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
+		struct mm_struct *mm)
 {
 	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
 			tbl->it_size, PAGE_SIZE);
@@ -152,13 +145,13 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
 
 	BUG_ON(tbl->it_userspace);
 
-	ret = try_increment_locked_vm(cb >> PAGE_SHIFT);
+	ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
 	if (ret)
 		return ret;
 
 	uas = vzalloc(cb);
 	if (!uas) {
-		decrement_locked_vm(cb >> PAGE_SHIFT);
+		decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 	tbl->it_userspace = uas;
@@ -166,7 +159,8 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
 	return 0;
 }
 
-static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
+static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
+		struct mm_struct *mm)
 {
 	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
 			tbl->it_size, PAGE_SIZE);
@@ -176,7 +170,7 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
 
 	vfree(tbl->it_userspace);
 	tbl->it_userspace = NULL;
-	decrement_locked_vm(cb >> PAGE_SHIFT);
+	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
 }
 
 static bool tce_page_is_contained(struct page *page, unsigned page_shift)
@@ -236,9 +230,6 @@ static int tce_iommu_enable(struct tce_container *container)
 	struct iommu_table_group *table_group;
 	struct tce_iommu_group *tcegrp;
 
-	if (!current->mm)
-		return -ESRCH; /* process exited */
-
 	if (container->enabled)
 		return -EBUSY;
 
@@ -284,7 +275,7 @@ static int tce_iommu_enable(struct tce_container *container)
 		return -EPERM;
 
 	locked = table_group->tce32_size >> PAGE_SHIFT;
-	ret = try_increment_locked_vm(locked);
+	ret = try_increment_locked_vm(container->mm, locked);
 	if (ret)
 		return ret;
 
@@ -302,10 +293,7 @@ static void tce_iommu_disable(struct tce_container *container)
 
 	container->enabled = false;
 
-	if (!current->mm)
-		return;
-
-	decrement_locked_vm(container->locked_pages);
+	decrement_locked_vm(container->mm, container->locked_pages);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
 
 	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
 
+	/* current->mm cannot be NULL in this context */
+	container->mm = current->mm;
+	atomic_inc(&container->mm->mm_count);
+
 	return container;
 }
 
 static int tce_iommu_clear(struct tce_container *container,
 		struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
-static void tce_iommu_free_table(struct iommu_table *tbl);
+static void tce_iommu_free_table(struct tce_container *container,
+		struct iommu_table *tbl);
 
 static void tce_iommu_release(void *iommu_data)
 {
@@ -357,10 +350,11 @@ static void tce_iommu_release(void *iommu_data)
 			continue;
 
 		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-		tce_iommu_free_table(tbl);
+		tce_iommu_free_table(container, tbl);
 	}
 
 	tce_iommu_disable(container);
+	mmdrop(container->mm);
 	mutex_destroy(&container->lock);
 
 	kfree(container);
@@ -375,13 +369,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
 	put_page(page);
 }
 
-static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
+static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
+		unsigned long tce, unsigned long size,
 		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(current->mm, tce, size);
+	mem = mm_iommu_lookup(container->mm, tce, size);
 	if (!mem)
 		return -EINVAL;
 
@@ -394,18 +389,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
 	return 0;
 }
 
-static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
-		unsigned long entry)
+static void tce_iommu_unuse_page_v2(struct tce_container *container,
+		struct iommu_table *tbl, unsigned long entry)
 {
 	struct mm_iommu_table_group_mem_t *mem = NULL;
 	int ret;
 	unsigned long hpa = 0;
 	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
 
-	if (!pua || !current || !current->mm)
+	if (!pua)
 		return;
 
-	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
+	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
 			&hpa, &mem);
 	if (ret)
 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
@@ -435,7 +430,7 @@ static int tce_iommu_clear(struct tce_container *container,
 			continue;
 
 		if (container->v2) {
-			tce_iommu_unuse_page_v2(tbl, entry);
+			tce_iommu_unuse_page_v2(container, tbl, entry);
 			continue;
 		}
 
@@ -520,8 +515,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
 				entry + i);
 
-		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
-				&hpa, &mem);
+		ret = tce_iommu_prereg_ua_to_hpa(container,
+				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
 		if (ret)
 			break;
 
@@ -542,7 +537,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
 		if (ret) {
 			/* dirtmp cannot be DMA_NONE here */
-			tce_iommu_unuse_page_v2(tbl, entry + i);
+			tce_iommu_unuse_page_v2(container, tbl, entry + i);
 			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
 					__func__, entry << tbl->it_page_shift,
 					tce, ret);
@@ -550,7 +545,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 		}
 
 		if (dirtmp != DMA_NONE)
-			tce_iommu_unuse_page_v2(tbl, entry + i);
+			tce_iommu_unuse_page_v2(container, tbl, entry + i);
 
 		*pua = tce;
 
@@ -578,7 +573,7 @@ static long tce_iommu_create_table(struct tce_container *container,
 	if (!table_size)
 		return -EINVAL;
 
-	ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
+	ret = try_increment_locked_vm(container->mm, table_size >> PAGE_SHIFT);
 	if (ret)
 		return ret;
 
@@ -589,24 +584,25 @@ static long tce_iommu_create_table(struct tce_container *container,
 	WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
 
 	if (!ret && container->v2) {
-		ret = tce_iommu_userspace_view_alloc(*ptbl);
+		ret = tce_iommu_userspace_view_alloc(*ptbl, container->mm);
 		if (ret)
 			(*ptbl)->it_ops->free(*ptbl);
 	}
 
 	if (ret)
-		decrement_locked_vm(table_size >> PAGE_SHIFT);
+		decrement_locked_vm(container->mm, table_size >> PAGE_SHIFT);
 
 	return ret;
 }
 
-static void tce_iommu_free_table(struct iommu_table *tbl)
+static void tce_iommu_free_table(struct tce_container *container,
+		struct iommu_table *tbl)
 {
 	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
 
-	tce_iommu_userspace_view_free(tbl);
+	tce_iommu_userspace_view_free(tbl, container->mm);
 	tbl->it_ops->free(tbl);
-	decrement_locked_vm(pages);
+	decrement_locked_vm(container->mm, pages);
 }
 
 static long tce_iommu_create_window(struct tce_container *container,
@@ -669,7 +665,7 @@ static long tce_iommu_create_window(struct tce_container *container,
 		table_group = iommu_group_get_iommudata(tcegrp->grp);
 		table_group->ops->unset_window(table_group, num);
 	}
-	tce_iommu_free_table(tbl);
+	tce_iommu_free_table(container, tbl);
 
 	return ret;
 }
@@ -707,7 +703,7 @@ static long tce_iommu_remove_window(struct tce_container *container,
 
 	/* Free table */
 	tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-	tce_iommu_free_table(tbl);
+	tce_iommu_free_table(container, tbl);
 	container->tables[num] = NULL;
 
 	return 0;
@@ -720,8 +716,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 	unsigned long minsz, ddwsz;
 	long ret;
 
-	switch (cmd) {
-	case VFIO_CHECK_EXTENSION:
+	if (cmd == VFIO_CHECK_EXTENSION) {
 		switch (arg) {
 		case VFIO_SPAPR_TCE_IOMMU:
 		case VFIO_SPAPR_TCE_v2_IOMMU:
@@ -733,7 +728,13 @@ static long tce_iommu_ioctl(void *iommu_data,
 		}
 
 		return (ret < 0) ? 0 : ret;
+	}
 
+	/* tce_iommu_open() initializes container->mm so it can't be NULL here */
+	if (container->mm != current->mm)
+		return -ESRCH;
+
+	switch (cmd) {
 	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
 		struct vfio_iommu_spapr_tce_info info;
 		struct tce_iommu_group *tcegrp;
@@ -1049,7 +1050,7 @@ static void tce_iommu_release_ownership(struct tce_container *container,
 			continue;
 
 		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-		tce_iommu_userspace_view_free(tbl);
+		tce_iommu_userspace_view_free(tbl, container->mm);
 		if (tbl->it_map)
 			iommu_release_ownership(tbl);
 
@@ -1068,7 +1069,7 @@ static int tce_iommu_take_ownership(struct tce_container *container,
 		if (!tbl || !tbl->it_map)
 			continue;
 
-		rc = tce_iommu_userspace_view_alloc(tbl);
+		rc = tce_iommu_userspace_view_alloc(tbl, container->mm);
 		if (!rc)
 			rc = iommu_take_ownership(tbl);
 
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
@ 2016-10-24  6:53 ` Alexey Kardashevskiy
  2016-10-25  4:44   ` David Gibson
  2016-11-08  3:35   ` David Gibson
  2016-11-08  7:54 ` [PATCH kernel v4 0/4] powerpc/spapr/vfio: " Michael Ellerman
  4 siblings, 2 replies; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-24  6:53 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alex Williamson, David Gibson,
	Nicholas Piggin, Paul Mackerras

At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
Changes:
v4:
* changed tce_iommu_register_pages() to call mm_iommu_find() first and
avoid calling mm_iommu_put() if memory is preregistered already

v3:
* moved tce_iommu_prereg_free() call out of list_for_each_entry()

v2:
* updated commit log
---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
 3 files changed, 57 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..1a07969 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-	mm_iommu_cleanup(mm);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
 	drop_cop(mm->context.acop, mm);
 	kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 4c6db09..104bad0 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
 	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-	struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
-			next) {
-		list_del_rcu(&mem->next);
-		mm_iommu_do_free(mem);
-	}
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 81ab93f..001a488 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -86,6 +86,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered region  it has
+ * referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+	struct list_head next;
+	struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -98,12 +107,27 @@ struct tce_container {
 	struct mm_struct *mm;
 	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 	struct list_head group_list;
+	struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+		struct tce_iommu_prereg *tcemem)
+{
+	long ret;
+
+	list_del(&tcemem->next);
+	ret = mm_iommu_put(container->mm, tcemem->mem);
+	kfree(tcemem);
+
+	return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
 		__u64 vaddr, __u64 size)
 {
 	struct mm_iommu_table_group_mem_t *mem;
+	struct tce_iommu_prereg *tcemem;
+	bool found = false;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
 		return -EINVAL;
@@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
 	if (!mem)
 		return -ENOENT;
 
-	return mm_iommu_put(container->mm, mem);
+	list_for_each_entry(tcemem, &container->prereg_list, next) {
+		if (tcemem->mem == mem) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return -ENOENT;
+
+	return tce_iommu_prereg_free(container, tcemem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem = NULL;
+	struct tce_iommu_prereg *tcemem;
 	unsigned long entries = size >> PAGE_SHIFT;
 
 	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
 			((vaddr + size) < vaddr))
 		return -EINVAL;
 
+	mem = mm_iommu_find(container->mm, vaddr, entries);
+	if (mem) {
+		list_for_each_entry(tcemem, &container->prereg_list, next) {
+			if (tcemem->mem == mem)
+				return -EBUSY;
+		}
+	}
+
 	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
 	if (ret)
 		return ret;
 
+	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
+	tcemem->mem = mem;
+	list_add(&tcemem->next, &container->prereg_list);
+
 	container->enabled = true;
 
 	return 0;
@@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
 
 	mutex_init(&container->lock);
 	INIT_LIST_HEAD_RCU(&container->group_list);
+	INIT_LIST_HEAD_RCU(&container->prereg_list);
 
 	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
 
@@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
 		tce_iommu_free_table(container, tbl);
 	}
 
+	while (!list_empty(&container->prereg_list)) {
+		struct tce_iommu_prereg *tcemem;
+
+		tcemem = list_first_entry(&container->prereg_list,
+				struct tce_iommu_prereg, next);
+		tce_iommu_prereg_free(container, tcemem);
+	}
+
 	tce_iommu_disable(container);
 	mmdrop(container->mm);
 	mutex_destroy(&container->lock);
-- 
2.5.0.rc3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container
  2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
@ 2016-10-24  7:14   ` Nicholas Piggin
  2016-11-08  3:33   ` David Gibson
  2016-11-08 22:25   ` Alex Williamson
  2 siblings, 0 replies; 21+ messages in thread
From: Nicholas Piggin @ 2016-10-24  7:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, David Gibson, Paul Mackerras

On Mon, 24 Oct 2016 17:53:09 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better have tce_container take a reference to mm_struct and
> use it later when the process is gone (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from making changes to a memory
> context of other processes; in order to add this check,
> VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> quite special anyway - it is the only ioctl() called when neither
> container nor container->mm is initialized.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added check for container->mm!=current->mm in tce_iommu_ioctl()
> for all ioctls and removed other redundand checks

[...]

> @@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> +	/* current->mm cannot be NULL in this context */
> +	container->mm = current->mm;
> +	atomic_inc(&container->mm->mm_count);

[...]

> @@ -733,7 +728,13 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		}
>  
>  		return (ret < 0) ? 0 : ret;
> +	}
>  
> +	/* tce_iommu_open() initializes container->mm so it can't be NULL here */
> +	if (container->mm != current->mm)
> +		return -ESRCH;
> +
> +	switch (cmd) {
>  	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>  		struct vfio_iommu_spapr_tce_info info;
>  		struct tce_iommu_group *tcegrp;

I think doing the mm checks like this is a great improvement.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-24  6:53 ` [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown Alexey Kardashevskiy
@ 2016-10-25  4:44   ` David Gibson
  2016-10-25  4:55     ` Alexey Kardashevskiy
  2016-11-08  3:35   ` David Gibson
  1 sibling, 1 reply; 21+ messages in thread
From: David Gibson @ 2016-10-25  4:44 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 7592 bytes --]

On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> At the moment the userspace tool is expected to request pinning of
> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> When the userspace process finishes, all the pinned pages need to
> be put; this is done as a part of the userspace memory context (MM)
> destruction which happens on the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This moves preregistered regions tracking from MM to VFIO; insteads of
> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> added so each container releases regions which it has pre-registered.
> 
> This changes the userspace interface to return EBUSY if a memory
> region is already registered in a container. However it should not
> have any practical effect as the only userspace tool available now
> does register memory region once per container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

On the grounds that this leaves things in a better state than before:

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

On the other hand the implementation is kind of clunky, with the way
it keeps the mm-level and vfio-level lists of regions in parallel.
With this change, does the mm-level list actually serve any purpose at
all, or could it all be moved into the vfio-level list?

> ---
> Changes:
> v4:
> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> avoid calling mm_iommu_put() if memory is preregistered already
> 
> v3:
> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> 
> v2:
> * updated commit log
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
>  3 files changed, 57 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>  	drop_cop(mm->context.acop, mm);
>  	kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 4c6db09..104bad0 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> -			next) {
> -		list_del_rcu(&mem->next);
> -		mm_iommu_do_free(mem);
> -	}
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 81ab93f..001a488 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered region  it has
> + * referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> +	struct list_head next;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -98,12 +107,27 @@ struct tce_container {
>  	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
> +	struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> +		struct tce_iommu_prereg *tcemem)
> +{
> +	long ret;
> +
> +	list_del(&tcemem->next);
> +	ret = mm_iommu_put(container->mm, tcemem->mem);
> +	kfree(tcemem);
> +
> +	return ret;
> +}
> +
>  static long tce_iommu_unregister_pages(struct tce_container *container,
>  		__u64 vaddr, __u64 size)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> +	struct tce_iommu_prereg *tcemem;
> +	bool found = false;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(container->mm, mem);
> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> +		if (tcemem->mem == mem) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found)
> +		return -ENOENT;
> +
> +	return tce_iommu_prereg_free(container, tcemem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	struct tce_iommu_prereg *tcemem;
>  	unsigned long entries = size >> PAGE_SHIFT;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> +	mem = mm_iommu_find(container->mm, vaddr, entries);
> +	if (mem) {
> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
> +			if (tcemem->mem == mem)
> +				return -EBUSY;
> +		}
> +	}
> +
>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
>  
> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
> +	tcemem->mem = mem;
> +	list_add(&tcemem->next, &container->prereg_list);
> +
>  	container->enabled = true;
>  
>  	return 0;
> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	mutex_init(&container->lock);
>  	INIT_LIST_HEAD_RCU(&container->group_list);
> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
>  		tce_iommu_free_table(container, tbl);
>  	}
>  
> +	while (!list_empty(&container->prereg_list)) {
> +		struct tce_iommu_prereg *tcemem;
> +
> +		tcemem = list_first_entry(&container->prereg_list,
> +				struct tce_iommu_prereg, next);
> +		tce_iommu_prereg_free(container, tcemem);
> +	}
> +
>  	tce_iommu_disable(container);
>  	mmdrop(container->mm);
>  	mutex_destroy(&container->lock);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-25  4:44   ` David Gibson
@ 2016-10-25  4:55     ` Alexey Kardashevskiy
  2016-10-31  3:13       ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-25  4:55 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 7943 bytes --]

On 25/10/16 15:44, David Gibson wrote:
> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
>> At the moment the userspace tool is expected to request pinning of
>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
>> When the userspace process finishes, all the pinned pages need to
>> be put; this is done as a part of the userspace memory context (MM)
>> destruction which happens on the very last mmdrop().
>>
>> This approach has a problem that a MM of the userspace process
>> may live longer than the userspace process itself as kernel threads
>> use userspace process MMs which was runnning on a CPU where
>> the kernel thread was scheduled to. If this happened, the MM remains
>> referenced until this exact kernel thread wakes up again
>> and releases the very last reference to the MM, on an idle system this
>> can take even hours.
>>
>> This moves preregistered regions tracking from MM to VFIO; insteads of
>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
>> added so each container releases regions which it has pre-registered.
>>
>> This changes the userspace interface to return EBUSY if a memory
>> region is already registered in a container. However it should not
>> have any practical effect as the only userspace tool available now
>> does register memory region once per container anyway.
>>
>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
>> under container->lock, this does not need additional locking.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> 
> On the grounds that this leaves things in a better state than before:
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> On the other hand the implementation is kind of clunky, with the way
> it keeps the mm-level and vfio-level lists of regions in parallel.
> With this change, does the mm-level list actually serve any purpose at
> all, or could it all be moved into the vfio-level list?


The mm-level list allows not having gup() called for each container (minor
thing, I suppose) and it also tracks a number of active mappings which will
become useful when we add in-kernel real-mode TCE acceleration as
vfio-level code cannot run in realmode.



> 
>> ---
>> Changes:
>> v4:
>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
>> avoid calling mm_iommu_put() if memory is preregistered already
>>
>> v3:
>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
>>
>> v2:
>> * updated commit log
>> ---
>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
>>  3 files changed, 57 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
>> index ad82735..1a07969 100644
>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>>  
>>  void destroy_context(struct mm_struct *mm)
>>  {
>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>> -	mm_iommu_cleanup(mm);
>> -#endif
>> -
>>  #ifdef CONFIG_PPC_ICSWX
>>  	drop_cop(mm->context.acop, mm);
>>  	kfree(mm->context.cop_lockp);
>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
>> index 4c6db09..104bad0 100644
>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>>  {
>>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>>  }
>> -
>> -void mm_iommu_cleanup(struct mm_struct *mm)
>> -{
>> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
>> -
>> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
>> -			next) {
>> -		list_del_rcu(&mem->next);
>> -		mm_iommu_do_free(mem);
>> -	}
>> -}
>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>> index 81ab93f..001a488 100644
>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>>  };
>>  
>>  /*
>> + * A container needs to remember which preregistered region  it has
>> + * referenced to do proper cleanup at the userspace process exit.
>> + */
>> +struct tce_iommu_prereg {
>> +	struct list_head next;
>> +	struct mm_iommu_table_group_mem_t *mem;
>> +};
>> +
>> +/*
>>   * The container descriptor supports only a single group per container.
>>   * Required by the API as the container is not supplied with the IOMMU group
>>   * at the moment of initialization.
>> @@ -98,12 +107,27 @@ struct tce_container {
>>  	struct mm_struct *mm;
>>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>  	struct list_head group_list;
>> +	struct list_head prereg_list;
>>  };
>>  
>> +static long tce_iommu_prereg_free(struct tce_container *container,
>> +		struct tce_iommu_prereg *tcemem)
>> +{
>> +	long ret;
>> +
>> +	list_del(&tcemem->next);
>> +	ret = mm_iommu_put(container->mm, tcemem->mem);
>> +	kfree(tcemem);
>> +
>> +	return ret;
>> +}
>> +
>>  static long tce_iommu_unregister_pages(struct tce_container *container,
>>  		__u64 vaddr, __u64 size)
>>  {
>>  	struct mm_iommu_table_group_mem_t *mem;
>> +	struct tce_iommu_prereg *tcemem;
>> +	bool found = false;
>>  
>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>>  		return -EINVAL;
>> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>>  	if (!mem)
>>  		return -ENOENT;
>>  
>> -	return mm_iommu_put(container->mm, mem);
>> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
>> +		if (tcemem->mem == mem) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!found)
>> +		return -ENOENT;
>> +
>> +	return tce_iommu_prereg_free(container, tcemem);
>>  }
>>  
>>  static long tce_iommu_register_pages(struct tce_container *container,
>> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
>>  {
>>  	long ret = 0;
>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>> +	struct tce_iommu_prereg *tcemem;
>>  	unsigned long entries = size >> PAGE_SHIFT;
>>  
>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>>  			((vaddr + size) < vaddr))
>>  		return -EINVAL;
>>  
>> +	mem = mm_iommu_find(container->mm, vaddr, entries);
>> +	if (mem) {
>> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
>> +			if (tcemem->mem == mem)
>> +				return -EBUSY;
>> +		}
>> +	}
>> +
>>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>>  	if (ret)
>>  		return ret;
>>  
>> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>> +	tcemem->mem = mem;
>> +	list_add(&tcemem->next, &container->prereg_list);
>> +
>>  	container->enabled = true;
>>  
>>  	return 0;
>> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
>>  
>>  	mutex_init(&container->lock);
>>  	INIT_LIST_HEAD_RCU(&container->group_list);
>> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>>  
>>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>  
>> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
>>  		tce_iommu_free_table(container, tbl);
>>  	}
>>  
>> +	while (!list_empty(&container->prereg_list)) {
>> +		struct tce_iommu_prereg *tcemem;
>> +
>> +		tcemem = list_first_entry(&container->prereg_list,
>> +				struct tce_iommu_prereg, next);
>> +		tce_iommu_prereg_free(container, tcemem);
>> +	}
>> +
>>  	tce_iommu_disable(container);
>>  	mmdrop(container->mm);
>>  	mutex_destroy(&container->lock);
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-25  4:55     ` Alexey Kardashevskiy
@ 2016-10-31  3:13       ` David Gibson
  2016-10-31  4:13         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2016-10-31  3:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 8994 bytes --]

On Tue, Oct 25, 2016 at 03:55:56PM +1100, Alexey Kardashevskiy wrote:
> On 25/10/16 15:44, David Gibson wrote:
> > On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> >> At the moment the userspace tool is expected to request pinning of
> >> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> >> When the userspace process finishes, all the pinned pages need to
> >> be put; this is done as a part of the userspace memory context (MM)
> >> destruction which happens on the very last mmdrop().
> >>
> >> This approach has a problem that a MM of the userspace process
> >> may live longer than the userspace process itself as kernel threads
> >> use userspace process MMs which was runnning on a CPU where
> >> the kernel thread was scheduled to. If this happened, the MM remains
> >> referenced until this exact kernel thread wakes up again
> >> and releases the very last reference to the MM, on an idle system this
> >> can take even hours.
> >>
> >> This moves preregistered regions tracking from MM to VFIO; insteads of
> >> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> >> added so each container releases regions which it has pre-registered.
> >>
> >> This changes the userspace interface to return EBUSY if a memory
> >> region is already registered in a container. However it should not
> >> have any practical effect as the only userspace tool available now
> >> does register memory region once per container anyway.
> >>
> >> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> >> under container->lock, this does not need additional locking.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> > 
> > On the grounds that this leaves things in a better state than before:
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > On the other hand the implementation is kind of clunky, with the way
> > it keeps the mm-level and vfio-level lists of regions in parallel.
> > With this change, does the mm-level list actually serve any purpose at
> > all, or could it all be moved into the vfio-level list?
> 
> 
> The mm-level list allows not having gup() called for each container (minor
> thing, I suppose) and it also tracks a number of active mappings which will
> become useful when we add in-kernel real-mode TCE acceleration as
> vfio-level code cannot run in realmode.

Hm, ok.  So, if two different containers pre-register the same region
of memory, IIUC in the proposed code, the region will get one entry in
the mm level list, and that entry will be referenced in the lists for
both containers.  Yes?

What happens if two different containers try to pre-register
different, but overlapping, mm regions?

> 
> 
> 
> > 
> >> ---
> >> Changes:
> >> v4:
> >> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> >> avoid calling mm_iommu_put() if memory is preregistered already
> >>
> >> v3:
> >> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> >>
> >> v2:
> >> * updated commit log
> >> ---
> >>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
> >>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
> >>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
> >>  3 files changed, 57 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> >> index ad82735..1a07969 100644
> >> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> >> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> >> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
> >>  
> >>  void destroy_context(struct mm_struct *mm)
> >>  {
> >> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> >> -	mm_iommu_cleanup(mm);
> >> -#endif
> >> -
> >>  #ifdef CONFIG_PPC_ICSWX
> >>  	drop_cop(mm->context.acop, mm);
> >>  	kfree(mm->context.cop_lockp);
> >> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> >> index 4c6db09..104bad0 100644
> >> --- a/arch/powerpc/mm/mmu_context_iommu.c
> >> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> >> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
> >>  {
> >>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
> >>  }
> >> -
> >> -void mm_iommu_cleanup(struct mm_struct *mm)
> >> -{
> >> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
> >> -
> >> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> >> -			next) {
> >> -		list_del_rcu(&mem->next);
> >> -		mm_iommu_do_free(mem);
> >> -	}
> >> -}
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index 81ab93f..001a488 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -86,6 +86,15 @@ struct tce_iommu_group {
> >>  };
> >>  
> >>  /*
> >> + * A container needs to remember which preregistered region  it has
> >> + * referenced to do proper cleanup at the userspace process exit.
> >> + */
> >> +struct tce_iommu_prereg {
> >> +	struct list_head next;
> >> +	struct mm_iommu_table_group_mem_t *mem;
> >> +};
> >> +
> >> +/*
> >>   * The container descriptor supports only a single group per container.
> >>   * Required by the API as the container is not supplied with the IOMMU group
> >>   * at the moment of initialization.
> >> @@ -98,12 +107,27 @@ struct tce_container {
> >>  	struct mm_struct *mm;
> >>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>  	struct list_head group_list;
> >> +	struct list_head prereg_list;
> >>  };
> >>  
> >> +static long tce_iommu_prereg_free(struct tce_container *container,
> >> +		struct tce_iommu_prereg *tcemem)
> >> +{
> >> +	long ret;
> >> +
> >> +	list_del(&tcemem->next);
> >> +	ret = mm_iommu_put(container->mm, tcemem->mem);
> >> +	kfree(tcemem);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static long tce_iommu_unregister_pages(struct tce_container *container,
> >>  		__u64 vaddr, __u64 size)
> >>  {
> >>  	struct mm_iommu_table_group_mem_t *mem;
> >> +	struct tce_iommu_prereg *tcemem;
> >> +	bool found = false;
> >>  
> >>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >>  		return -EINVAL;
> >> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
> >>  	if (!mem)
> >>  		return -ENOENT;
> >>  
> >> -	return mm_iommu_put(container->mm, mem);
> >> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> >> +		if (tcemem->mem == mem) {
> >> +			found = true;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> +	if (!found)
> >> +		return -ENOENT;
> >> +
> >> +	return tce_iommu_prereg_free(container, tcemem);
> >>  }
> >>  
> >>  static long tce_iommu_register_pages(struct tce_container *container,
> >> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >>  {
> >>  	long ret = 0;
> >>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> >> +	struct tce_iommu_prereg *tcemem;
> >>  	unsigned long entries = size >> PAGE_SHIFT;
> >>  
> >>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> >>  			((vaddr + size) < vaddr))
> >>  		return -EINVAL;
> >>  
> >> +	mem = mm_iommu_find(container->mm, vaddr, entries);
> >> +	if (mem) {
> >> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
> >> +			if (tcemem->mem == mem)
> >> +				return -EBUSY;
> >> +		}
> >> +	}
> >> +
> >>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> >>  	if (ret)
> >>  		return ret;
> >>  
> >> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
> >> +	tcemem->mem = mem;
> >> +	list_add(&tcemem->next, &container->prereg_list);
> >> +
> >>  	container->enabled = true;
> >>  
> >>  	return 0;
> >> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
> >>  
> >>  	mutex_init(&container->lock);
> >>  	INIT_LIST_HEAD_RCU(&container->group_list);
> >> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
> >>  
> >>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>  
> >> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
> >>  		tce_iommu_free_table(container, tbl);
> >>  	}
> >>  
> >> +	while (!list_empty(&container->prereg_list)) {
> >> +		struct tce_iommu_prereg *tcemem;
> >> +
> >> +		tcemem = list_first_entry(&container->prereg_list,
> >> +				struct tce_iommu_prereg, next);
> >> +		tce_iommu_prereg_free(container, tcemem);
> >> +	}
> >> +
> >>  	tce_iommu_disable(container);
> >>  	mmdrop(container->mm);
> >>  	mutex_destroy(&container->lock);
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-31  3:13       ` David Gibson
@ 2016-10-31  4:13         ` Alexey Kardashevskiy
  2016-10-31  4:23           ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-31  4:13 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 9107 bytes --]

On 31/10/16 14:13, David Gibson wrote:
> On Tue, Oct 25, 2016 at 03:55:56PM +1100, Alexey Kardashevskiy wrote:
>> On 25/10/16 15:44, David Gibson wrote:
>>> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
>>>> At the moment the userspace tool is expected to request pinning of
>>>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
>>>> When the userspace process finishes, all the pinned pages need to
>>>> be put; this is done as a part of the userspace memory context (MM)
>>>> destruction which happens on the very last mmdrop().
>>>>
>>>> This approach has a problem that a MM of the userspace process
>>>> may live longer than the userspace process itself as kernel threads
>>>> use userspace process MMs which was runnning on a CPU where
>>>> the kernel thread was scheduled to. If this happened, the MM remains
>>>> referenced until this exact kernel thread wakes up again
>>>> and releases the very last reference to the MM, on an idle system this
>>>> can take even hours.
>>>>
>>>> This moves preregistered regions tracking from MM to VFIO; insteads of
>>>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
>>>> added so each container releases regions which it has pre-registered.
>>>>
>>>> This changes the userspace interface to return EBUSY if a memory
>>>> region is already registered in a container. However it should not
>>>> have any practical effect as the only userspace tool available now
>>>> does register memory region once per container anyway.
>>>>
>>>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
>>>> under container->lock, this does not need additional locking.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>>>
>>> On the grounds that this leaves things in a better state than before:
>>>
>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>
>>> On the other hand the implementation is kind of clunky, with the way
>>> it keeps the mm-level and vfio-level lists of regions in parallel.
>>> With this change, does the mm-level list actually serve any purpose at
>>> all, or could it all be moved into the vfio-level list?
>>
>>
>> The mm-level list allows not having gup() called for each container (minor
>> thing, I suppose) and it also tracks a number of active mappings which will
>> become useful when we add in-kernel real-mode TCE acceleration as
>> vfio-level code cannot run in realmode.
> 
> Hm, ok.  So, if two different containers pre-register the same region
> of memory, IIUC in the proposed code, the region will get one entry in
> the mm level list, and that entry will be referenced in the lists for
> both containers.  Yes?

Yes.


> What happens if two different containers try to pre-register
> different, but overlapping, mm regions?

The second container will fail to preregister memory - mm_iommu_get() will
return -EINVAL.


I am wondering what happens to the series now.

Alex, could you please have a look and comment? Thanks.



> 
>>
>>
>>
>>>
>>>> ---
>>>> Changes:
>>>> v4:
>>>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
>>>> avoid calling mm_iommu_put() if memory is preregistered already
>>>>
>>>> v3:
>>>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
>>>>
>>>> v2:
>>>> * updated commit log
>>>> ---
>>>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>>>>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>>>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
>>>>  3 files changed, 57 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
>>>> index ad82735..1a07969 100644
>>>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>>>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>>>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>>>>  
>>>>  void destroy_context(struct mm_struct *mm)
>>>>  {
>>>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> -	mm_iommu_cleanup(mm);
>>>> -#endif
>>>> -
>>>>  #ifdef CONFIG_PPC_ICSWX
>>>>  	drop_cop(mm->context.acop, mm);
>>>>  	kfree(mm->context.cop_lockp);
>>>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
>>>> index 4c6db09..104bad0 100644
>>>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>>>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>>>> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>>>>  {
>>>>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>>>>  }
>>>> -
>>>> -void mm_iommu_cleanup(struct mm_struct *mm)
>>>> -{
>>>> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
>>>> -
>>>> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
>>>> -			next) {
>>>> -		list_del_rcu(&mem->next);
>>>> -		mm_iommu_do_free(mem);
>>>> -	}
>>>> -}
>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> index 81ab93f..001a488 100644
>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>>>>  };
>>>>  
>>>>  /*
>>>> + * A container needs to remember which preregistered region  it has
>>>> + * referenced to do proper cleanup at the userspace process exit.
>>>> + */
>>>> +struct tce_iommu_prereg {
>>>> +	struct list_head next;
>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>> +};
>>>> +
>>>> +/*
>>>>   * The container descriptor supports only a single group per container.
>>>>   * Required by the API as the container is not supplied with the IOMMU group
>>>>   * at the moment of initialization.
>>>> @@ -98,12 +107,27 @@ struct tce_container {
>>>>  	struct mm_struct *mm;
>>>>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>>>  	struct list_head group_list;
>>>> +	struct list_head prereg_list;
>>>>  };
>>>>  
>>>> +static long tce_iommu_prereg_free(struct tce_container *container,
>>>> +		struct tce_iommu_prereg *tcemem)
>>>> +{
>>>> +	long ret;
>>>> +
>>>> +	list_del(&tcemem->next);
>>>> +	ret = mm_iommu_put(container->mm, tcemem->mem);
>>>> +	kfree(tcemem);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  static long tce_iommu_unregister_pages(struct tce_container *container,
>>>>  		__u64 vaddr, __u64 size)
>>>>  {
>>>>  	struct mm_iommu_table_group_mem_t *mem;
>>>> +	struct tce_iommu_prereg *tcemem;
>>>> +	bool found = false;
>>>>  
>>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>>>>  		return -EINVAL;
>>>> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>>>>  	if (!mem)
>>>>  		return -ENOENT;
>>>>  
>>>> -	return mm_iommu_put(container->mm, mem);
>>>> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
>>>> +		if (tcemem->mem == mem) {
>>>> +			found = true;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (!found)
>>>> +		return -ENOENT;
>>>> +
>>>> +	return tce_iommu_prereg_free(container, tcemem);
>>>>  }
>>>>  
>>>>  static long tce_iommu_register_pages(struct tce_container *container,
>>>> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
>>>>  {
>>>>  	long ret = 0;
>>>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>> +	struct tce_iommu_prereg *tcemem;
>>>>  	unsigned long entries = size >> PAGE_SHIFT;
>>>>  
>>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>>>>  			((vaddr + size) < vaddr))
>>>>  		return -EINVAL;
>>>>  
>>>> +	mem = mm_iommu_find(container->mm, vaddr, entries);
>>>> +	if (mem) {
>>>> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
>>>> +			if (tcemem->mem == mem)
>>>> +				return -EBUSY;
>>>> +		}
>>>> +	}
>>>> +
>>>>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>>>>  	if (ret)
>>>>  		return ret;
>>>>  
>>>> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>>>> +	tcemem->mem = mem;
>>>> +	list_add(&tcemem->next, &container->prereg_list);
>>>> +
>>>>  	container->enabled = true;
>>>>  
>>>>  	return 0;
>>>> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
>>>>  
>>>>  	mutex_init(&container->lock);
>>>>  	INIT_LIST_HEAD_RCU(&container->group_list);
>>>> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>>>>  
>>>>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>>>  
>>>> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
>>>>  		tce_iommu_free_table(container, tbl);
>>>>  	}
>>>>  
>>>> +	while (!list_empty(&container->prereg_list)) {
>>>> +		struct tce_iommu_prereg *tcemem;
>>>> +
>>>> +		tcemem = list_first_entry(&container->prereg_list,
>>>> +				struct tce_iommu_prereg, next);
>>>> +		tce_iommu_prereg_free(container, tcemem);
>>>> +	}
>>>> +
>>>>  	tce_iommu_disable(container);
>>>>  	mmdrop(container->mm);
>>>>  	mutex_destroy(&container->lock);
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-31  4:13         ` Alexey Kardashevskiy
@ 2016-10-31  4:23           ` David Gibson
  2016-11-02  2:44             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2016-10-31  4:23 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 10145 bytes --]

On Mon, Oct 31, 2016 at 03:13:21PM +1100, Alexey Kardashevskiy wrote:
> On 31/10/16 14:13, David Gibson wrote:
> > On Tue, Oct 25, 2016 at 03:55:56PM +1100, Alexey Kardashevskiy wrote:
> >> On 25/10/16 15:44, David Gibson wrote:
> >>> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> >>>> At the moment the userspace tool is expected to request pinning of
> >>>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> >>>> When the userspace process finishes, all the pinned pages need to
> >>>> be put; this is done as a part of the userspace memory context (MM)
> >>>> destruction which happens on the very last mmdrop().
> >>>>
> >>>> This approach has a problem that a MM of the userspace process
> >>>> may live longer than the userspace process itself as kernel threads
> >>>> use userspace process MMs which was runnning on a CPU where
> >>>> the kernel thread was scheduled to. If this happened, the MM remains
> >>>> referenced until this exact kernel thread wakes up again
> >>>> and releases the very last reference to the MM, on an idle system this
> >>>> can take even hours.
> >>>>
> >>>> This moves preregistered regions tracking from MM to VFIO; insteads of
> >>>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> >>>> added so each container releases regions which it has pre-registered.
> >>>>
> >>>> This changes the userspace interface to return EBUSY if a memory
> >>>> region is already registered in a container. However it should not
> >>>> have any practical effect as the only userspace tool available now
> >>>> does register memory region once per container anyway.
> >>>>
> >>>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> >>>> under container->lock, this does not need additional locking.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> >>>
> >>> On the grounds that this leaves things in a better state than before:
> >>>
> >>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>>
> >>> On the other hand the implementation is kind of clunky, with the way
> >>> it keeps the mm-level and vfio-level lists of regions in parallel.
> >>> With this change, does the mm-level list actually serve any purpose at
> >>> all, or could it all be moved into the vfio-level list?
> >>
> >>
> >> The mm-level list allows not having gup() called for each container (minor
> >> thing, I suppose) and it also tracks a number of active mappings which will
> >> become useful when we add in-kernel real-mode TCE acceleration as
> >> vfio-level code cannot run in realmode.
> > 
> > Hm, ok.  So, if two different containers pre-register the same region
> > of memory, IIUC in the proposed code, the region will get one entry in
> > the mm level list, and that entry will be referenced in the lists for
> > both containers.  Yes?
> 
> Yes.
> 
> 
> > What happens if two different containers try to pre-register
> > different, but overlapping, mm regions?
> 
> The second container will fail to preregister memory - mm_iommu_get() will
> return -EINVAL.

Um.. yeah.. that's not really ok.  Prohibiting overlapping
registrations on the same container is reasonable enough.  Having a
container not be able to register memory because some completely
different container has registered something overlapping is getting
very ugly.

> I am wondering what happens to the series now.
> 
> Alex, could you please have a look and comment? Thanks.
> 
> 
> 
> > 
> >>
> >>
> >>
> >>>
> >>>> ---
> >>>> Changes:
> >>>> v4:
> >>>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> >>>> avoid calling mm_iommu_put() if memory is preregistered already
> >>>>
> >>>> v3:
> >>>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> >>>>
> >>>> v2:
> >>>> * updated commit log
> >>>> ---
> >>>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
> >>>>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
> >>>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
> >>>>  3 files changed, 57 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> >>>> index ad82735..1a07969 100644
> >>>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> >>>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> >>>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
> >>>>  
> >>>>  void destroy_context(struct mm_struct *mm)
> >>>>  {
> >>>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> >>>> -	mm_iommu_cleanup(mm);
> >>>> -#endif
> >>>> -
> >>>>  #ifdef CONFIG_PPC_ICSWX
> >>>>  	drop_cop(mm->context.acop, mm);
> >>>>  	kfree(mm->context.cop_lockp);
> >>>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> >>>> index 4c6db09..104bad0 100644
> >>>> --- a/arch/powerpc/mm/mmu_context_iommu.c
> >>>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> >>>> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
> >>>>  {
> >>>>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
> >>>>  }
> >>>> -
> >>>> -void mm_iommu_cleanup(struct mm_struct *mm)
> >>>> -{
> >>>> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
> >>>> -
> >>>> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> >>>> -			next) {
> >>>> -		list_del_rcu(&mem->next);
> >>>> -		mm_iommu_do_free(mem);
> >>>> -	}
> >>>> -}
> >>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> index 81ab93f..001a488 100644
> >>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >>>> @@ -86,6 +86,15 @@ struct tce_iommu_group {
> >>>>  };
> >>>>  
> >>>>  /*
> >>>> + * A container needs to remember which preregistered region  it has
> >>>> + * referenced to do proper cleanup at the userspace process exit.
> >>>> + */
> >>>> +struct tce_iommu_prereg {
> >>>> +	struct list_head next;
> >>>> +	struct mm_iommu_table_group_mem_t *mem;
> >>>> +};
> >>>> +
> >>>> +/*
> >>>>   * The container descriptor supports only a single group per container.
> >>>>   * Required by the API as the container is not supplied with the IOMMU group
> >>>>   * at the moment of initialization.
> >>>> @@ -98,12 +107,27 @@ struct tce_container {
> >>>>  	struct mm_struct *mm;
> >>>>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >>>>  	struct list_head group_list;
> >>>> +	struct list_head prereg_list;
> >>>>  };
> >>>>  
> >>>> +static long tce_iommu_prereg_free(struct tce_container *container,
> >>>> +		struct tce_iommu_prereg *tcemem)
> >>>> +{
> >>>> +	long ret;
> >>>> +
> >>>> +	list_del(&tcemem->next);
> >>>> +	ret = mm_iommu_put(container->mm, tcemem->mem);
> >>>> +	kfree(tcemem);
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>>  static long tce_iommu_unregister_pages(struct tce_container *container,
> >>>>  		__u64 vaddr, __u64 size)
> >>>>  {
> >>>>  	struct mm_iommu_table_group_mem_t *mem;
> >>>> +	struct tce_iommu_prereg *tcemem;
> >>>> +	bool found = false;
> >>>>  
> >>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >>>>  		return -EINVAL;
> >>>> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
> >>>>  	if (!mem)
> >>>>  		return -ENOENT;
> >>>>  
> >>>> -	return mm_iommu_put(container->mm, mem);
> >>>> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> >>>> +		if (tcemem->mem == mem) {
> >>>> +			found = true;
> >>>> +			break;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	if (!found)
> >>>> +		return -ENOENT;
> >>>> +
> >>>> +	return tce_iommu_prereg_free(container, tcemem);
> >>>>  }
> >>>>  
> >>>>  static long tce_iommu_register_pages(struct tce_container *container,
> >>>> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >>>>  {
> >>>>  	long ret = 0;
> >>>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> >>>> +	struct tce_iommu_prereg *tcemem;
> >>>>  	unsigned long entries = size >> PAGE_SHIFT;
> >>>>  
> >>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> >>>>  			((vaddr + size) < vaddr))
> >>>>  		return -EINVAL;
> >>>>  
> >>>> +	mem = mm_iommu_find(container->mm, vaddr, entries);
> >>>> +	if (mem) {
> >>>> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
> >>>> +			if (tcemem->mem == mem)
> >>>> +				return -EBUSY;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> >>>>  	if (ret)
> >>>>  		return ret;
> >>>>  
> >>>> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
> >>>> +	tcemem->mem = mem;
> >>>> +	list_add(&tcemem->next, &container->prereg_list);
> >>>> +
> >>>>  	container->enabled = true;
> >>>>  
> >>>>  	return 0;
> >>>> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
> >>>>  
> >>>>  	mutex_init(&container->lock);
> >>>>  	INIT_LIST_HEAD_RCU(&container->group_list);
> >>>> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
> >>>>  
> >>>>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >>>>  
> >>>> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
> >>>>  		tce_iommu_free_table(container, tbl);
> >>>>  	}
> >>>>  
> >>>> +	while (!list_empty(&container->prereg_list)) {
> >>>> +		struct tce_iommu_prereg *tcemem;
> >>>> +
> >>>> +		tcemem = list_first_entry(&container->prereg_list,
> >>>> +				struct tce_iommu_prereg, next);
> >>>> +		tce_iommu_prereg_free(container, tcemem);
> >>>> +	}
> >>>> +
> >>>>  	tce_iommu_disable(container);
> >>>>  	mmdrop(container->mm);
> >>>>  	mutex_destroy(&container->lock);
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-31  4:23           ` David Gibson
@ 2016-11-02  2:44             ` Alexey Kardashevskiy
  2016-11-03  1:02               ` Paul Mackerras
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-02  2:44 UTC (permalink / raw)
  To: David Gibson
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras


[-- Attachment #1.1: Type: text/plain, Size: 10465 bytes --]

On 31/10/16 15:23, David Gibson wrote:
> On Mon, Oct 31, 2016 at 03:13:21PM +1100, Alexey Kardashevskiy wrote:
>> On 31/10/16 14:13, David Gibson wrote:
>>> On Tue, Oct 25, 2016 at 03:55:56PM +1100, Alexey Kardashevskiy wrote:
>>>> On 25/10/16 15:44, David Gibson wrote:
>>>>> On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
>>>>>> At the moment the userspace tool is expected to request pinning of
>>>>>> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
>>>>>> When the userspace process finishes, all the pinned pages need to
>>>>>> be put; this is done as a part of the userspace memory context (MM)
>>>>>> destruction which happens on the very last mmdrop().
>>>>>>
>>>>>> This approach has a problem that a MM of the userspace process
>>>>>> may live longer than the userspace process itself as kernel threads
>>>>>> use userspace process MMs which was runnning on a CPU where
>>>>>> the kernel thread was scheduled to. If this happened, the MM remains
>>>>>> referenced until this exact kernel thread wakes up again
>>>>>> and releases the very last reference to the MM, on an idle system this
>>>>>> can take even hours.
>>>>>>
>>>>>> This moves preregistered regions tracking from MM to VFIO; insteads of
>>>>>> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
>>>>>> added so each container releases regions which it has pre-registered.
>>>>>>
>>>>>> This changes the userspace interface to return EBUSY if a memory
>>>>>> region is already registered in a container. However it should not
>>>>>> have any practical effect as the only userspace tool available now
>>>>>> does register memory region once per container anyway.
>>>>>>
>>>>>> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
>>>>>> under container->lock, this does not need additional locking.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>>>>>
>>>>> On the grounds that this leaves things in a better state than before:
>>>>>
>>>>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>>>>
>>>>> On the other hand the implementation is kind of clunky, with the way
>>>>> it keeps the mm-level and vfio-level lists of regions in parallel.
>>>>> With this change, does the mm-level list actually serve any purpose at
>>>>> all, or could it all be moved into the vfio-level list?
>>>>
>>>>
>>>> The mm-level list allows not having gup() called for each container (minor
>>>> thing, I suppose) and it also tracks a number of active mappings which will
>>>> become useful when we add in-kernel real-mode TCE acceleration as
>>>> vfio-level code cannot run in realmode.
>>>
>>> Hm, ok.  So, if two different containers pre-register the same region
>>> of memory, IIUC in the proposed code, the region will get one entry in
>>> the mm level list, and that entry will be referenced in the lists for
>>> both containers.  Yes?
>>
>> Yes.
>>
>>
>>> What happens if two different containers try to pre-register
>>> different, but overlapping, mm regions?
>>
>> The second container will fail to preregister memory - mm_iommu_get() will
>> return -EINVAL.
> 
> Um.. yeah.. that's not really ok.  Prohibiting overlapping
> registrations on the same container is reasonable enough.  Having a
> container not be able to register memory because some completely
> different container has registered something overlapping is getting
> very ugly.

I am lost here. Does this mean the patches cannot go upstream?

Also how would I implement overlapping if we are not teaching KVM about
VFIO containers? The mm list has a counter of how many times each memory
region was mapped via TCE (and this prevents unregistration), and if we
want overlapping regions - a "mapped" counter of which one would I update
in real mode (where I only have a user address and a LIOBN)?



> 
>> I am wondering what happens to the series now.
>>
>> Alex, could you please have a look and comment? Thanks.
>>
>>
>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>> ---
>>>>>> Changes:
>>>>>> v4:
>>>>>> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
>>>>>> avoid calling mm_iommu_put() if memory is preregistered already
>>>>>>
>>>>>> v3:
>>>>>> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
>>>>>>
>>>>>> v2:
>>>>>> * updated commit log
>>>>>> ---
>>>>>>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>>>>>>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>>>>>>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
>>>>>>  3 files changed, 57 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
>>>>>> index ad82735..1a07969 100644
>>>>>> --- a/arch/powerpc/mm/mmu_context_book3s64.c
>>>>>> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
>>>>>> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>>>>>>  
>>>>>>  void destroy_context(struct mm_struct *mm)
>>>>>>  {
>>>>>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>>> -	mm_iommu_cleanup(mm);
>>>>>> -#endif
>>>>>> -
>>>>>>  #ifdef CONFIG_PPC_ICSWX
>>>>>>  	drop_cop(mm->context.acop, mm);
>>>>>>  	kfree(mm->context.cop_lockp);
>>>>>> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
>>>>>> index 4c6db09..104bad0 100644
>>>>>> --- a/arch/powerpc/mm/mmu_context_iommu.c
>>>>>> +++ b/arch/powerpc/mm/mmu_context_iommu.c
>>>>>> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>>>>>>  {
>>>>>>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>>>>>>  }
>>>>>> -
>>>>>> -void mm_iommu_cleanup(struct mm_struct *mm)
>>>>>> -{
>>>>>> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
>>>>>> -
>>>>>> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
>>>>>> -			next) {
>>>>>> -		list_del_rcu(&mem->next);
>>>>>> -		mm_iommu_do_free(mem);
>>>>>> -	}
>>>>>> -}
>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> index 81ab93f..001a488 100644
>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>>>>> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>>>>>>  };
>>>>>>  
>>>>>>  /*
>>>>>> + * A container needs to remember which preregistered region  it has
>>>>>> + * referenced to do proper cleanup at the userspace process exit.
>>>>>> + */
>>>>>> +struct tce_iommu_prereg {
>>>>>> +	struct list_head next;
>>>>>> +	struct mm_iommu_table_group_mem_t *mem;
>>>>>> +};
>>>>>> +
>>>>>> +/*
>>>>>>   * The container descriptor supports only a single group per container.
>>>>>>   * Required by the API as the container is not supplied with the IOMMU group
>>>>>>   * at the moment of initialization.
>>>>>> @@ -98,12 +107,27 @@ struct tce_container {
>>>>>>  	struct mm_struct *mm;
>>>>>>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>>>>>>  	struct list_head group_list;
>>>>>> +	struct list_head prereg_list;
>>>>>>  };
>>>>>>  
>>>>>> +static long tce_iommu_prereg_free(struct tce_container *container,
>>>>>> +		struct tce_iommu_prereg *tcemem)
>>>>>> +{
>>>>>> +	long ret;
>>>>>> +
>>>>>> +	list_del(&tcemem->next);
>>>>>> +	ret = mm_iommu_put(container->mm, tcemem->mem);
>>>>>> +	kfree(tcemem);
>>>>>> +
>>>>>> +	return ret;
>>>>>> +}
>>>>>> +
>>>>>>  static long tce_iommu_unregister_pages(struct tce_container *container,
>>>>>>  		__u64 vaddr, __u64 size)
>>>>>>  {
>>>>>>  	struct mm_iommu_table_group_mem_t *mem;
>>>>>> +	struct tce_iommu_prereg *tcemem;
>>>>>> +	bool found = false;
>>>>>>  
>>>>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>>>>>>  		return -EINVAL;
>>>>>> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>>>>>>  	if (!mem)
>>>>>>  		return -ENOENT;
>>>>>>  
>>>>>> -	return mm_iommu_put(container->mm, mem);
>>>>>> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
>>>>>> +		if (tcemem->mem == mem) {
>>>>>> +			found = true;
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	}
>>>>>> +
>>>>>> +	if (!found)
>>>>>> +		return -ENOENT;
>>>>>> +
>>>>>> +	return tce_iommu_prereg_free(container, tcemem);
>>>>>>  }
>>>>>>  
>>>>>>  static long tce_iommu_register_pages(struct tce_container *container,
>>>>>> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
>>>>>>  {
>>>>>>  	long ret = 0;
>>>>>>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>>>>>> +	struct tce_iommu_prereg *tcemem;
>>>>>>  	unsigned long entries = size >> PAGE_SHIFT;
>>>>>>  
>>>>>>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>>>>>>  			((vaddr + size) < vaddr))
>>>>>>  		return -EINVAL;
>>>>>>  
>>>>>> +	mem = mm_iommu_find(container->mm, vaddr, entries);
>>>>>> +	if (mem) {
>>>>>> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
>>>>>> +			if (tcemem->mem == mem)
>>>>>> +				return -EBUSY;
>>>>>> +		}
>>>>>> +	}
>>>>>> +
>>>>>>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>>>>>>  	if (ret)
>>>>>>  		return ret;
>>>>>>  
>>>>>> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
>>>>>> +	tcemem->mem = mem;
>>>>>> +	list_add(&tcemem->next, &container->prereg_list);
>>>>>> +
>>>>>>  	container->enabled = true;
>>>>>>  
>>>>>>  	return 0;
>>>>>> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
>>>>>>  
>>>>>>  	mutex_init(&container->lock);
>>>>>>  	INIT_LIST_HEAD_RCU(&container->group_list);
>>>>>> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>>>>>>  
>>>>>>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>>>>>>  
>>>>>> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
>>>>>>  		tce_iommu_free_table(container, tbl);
>>>>>>  	}
>>>>>>  
>>>>>> +	while (!list_empty(&container->prereg_list)) {
>>>>>> +		struct tce_iommu_prereg *tcemem;
>>>>>> +
>>>>>> +		tcemem = list_first_entry(&container->prereg_list,
>>>>>> +				struct tce_iommu_prereg, next);
>>>>>> +		tce_iommu_prereg_free(container, tcemem);
>>>>>> +	}
>>>>>> +
>>>>>>  	tce_iommu_disable(container);
>>>>>>  	mmdrop(container->mm);
>>>>>>  	mutex_destroy(&container->lock);
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 
> 


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-11-02  2:44             ` Alexey Kardashevskiy
@ 2016-11-03  1:02               ` Paul Mackerras
  2016-11-08  3:33                 ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Mackerras @ 2016-11-03  1:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: David Gibson, linuxppc-dev, Alex Williamson, Nicholas Piggin

On Wed, Nov 02, 2016 at 01:44:03PM +1100, Alexey Kardashevskiy wrote:
> On 31/10/16 15:23, David Gibson wrote:
[...]
> > 
> > Um.. yeah.. that's not really ok.  Prohibiting overlapping
> > registrations on the same container is reasonable enough.  Having a
> > container not be able to register memory because some completely
> > different container has registered something overlapping is getting
> > very ugly.
> 
> I am lost here. Does this mean the patches cannot go upstream?
> 
> Also how would I implement overlapping if we are not teaching KVM about
> VFIO containers? The mm list has a counter of how many times each memory
> region was mapped via TCE (and this prevents unregistration), and if we
> want overlapping regions - a "mapped" counter of which one would I update
> in real mode (where I only have a user address and a LIOBN)?

The patches fix a real bug, where we run out of memory to run VMs.

The patches don't change the interface, and don't introduce the
constraint that is being discussed here (that the regions being
registered may not overlap unless they are identical to a previously
registered region).  That constraint is already present in the
upstream code.

They do change the behaviour when you use a container fd from a
different process from the one which opened the fd, but that is not
something that worked in any meaningful way before anyway.

So David, do you still see any reason why the patches should not be
accepted?

Regards,
Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-11-03  1:02               ` Paul Mackerras
@ 2016-11-08  3:33                 ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2016-11-08  3:33 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, linuxppc-dev, Alex Williamson, Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 2019 bytes --]

On Thu, Nov 03, 2016 at 12:02:28PM +1100, Paul Mackerras wrote:
> On Wed, Nov 02, 2016 at 01:44:03PM +1100, Alexey Kardashevskiy wrote:
> > On 31/10/16 15:23, David Gibson wrote:
> [...]
> > > 
> > > Um.. yeah.. that's not really ok.  Prohibiting overlapping
> > > registrations on the same container is reasonable enough.  Having a
> > > container not be able to register memory because some completely
> > > different container has registered something overlapping is getting
> > > very ugly.
> > 
> > I am lost here. Does this mean the patches cannot go upstream?
> > 
> > Also how would I implement overlapping if we are not teaching KVM about
> > VFIO containers? The mm list has a counter of how many times each memory
> > region was mapped via TCE (and this prevents unregistration), and if we
> > want overlapping regions - a "mapped" counter of which one would I update
> > in real mode (where I only have a user address and a LIOBN)?
> 
> The patches fix a real bug, where we run out of memory to run VMs.
> 
> The patches don't change the interface, and don't introduce the
> constraint that is being discussed here (that the regions being
> registered may not overlap unless they are identical to a previously
> registered region).  That constraint is already present in the
> upstream code.

Ah, good point.  I hadn't thought that through and realized that
limitation was already through.

> They do change the behaviour when you use a container fd from a
> different process from the one which opened the fd, but that is not
> something that worked in any meaningful way before anyway.
> 
> So David, do you still see any reason why the patches should not be
> accepted?

Given the (still hideously ugly) limitation is not new, then no, not
any more.  I'll send an R-b.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container
  2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
  2016-10-24  7:14   ` Nicholas Piggin
@ 2016-11-08  3:33   ` David Gibson
  2016-11-08 22:25   ` Alex Williamson
  2 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2016-11-08  3:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 15168 bytes --]

On Mon, Oct 24, 2016 at 05:53:09PM +1100, Alexey Kardashevskiy wrote:
> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better have tce_container take a reference to mm_struct and
> use it later when the process is gone (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from making changes to a memory
> context of other processes; in order to add this check,
> VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> quite special anyway - it is the only ioctl() called when neither
> container nor container->mm is initialized.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> Changes:
> v4:
> * added check for container->mm!=current->mm in tce_iommu_ioctl()
> for all ioctls and removed other redundand checks
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 131 ++++++++++++++++++------------------
>  1 file changed, 66 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d0c38b2..81ab93f 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -31,49 +31,46 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(long npages)
> +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  {
>  	long ret = 0, locked, lock_limit;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (!npages)
>  		return 0;
>  
> -	down_write(&current->mm->mmap_sem);
> -	locked = current->mm->locked_vm + npages;
> +	down_write(&mm->mmap_sem);
> +	locked = mm->locked_vm + npages;
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>  		ret = -ENOMEM;
>  	else
> -		current->mm->locked_vm += npages;
> +		mm->locked_vm += npages;
>  
>  	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			current->mm->locked_vm << PAGE_SHIFT,
> +			mm->locked_vm << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK),
>  			ret ? " - exceeded" : "");
>  
> -	up_write(&current->mm->mmap_sem);
> +	up_write(&mm->mmap_sem);
>  
>  	return ret;
>  }
>  
> -static void decrement_locked_vm(long npages)
> +static void decrement_locked_vm(struct mm_struct *mm, long npages)
>  {
> -	if (!current || !current->mm || !npages)
> -		return; /* process exited */
> +	if (!npages)
> +		return;
>  
> -	down_write(&current->mm->mmap_sem);
> -	if (WARN_ON_ONCE(npages > current->mm->locked_vm))
> -		npages = current->mm->locked_vm;
> -	current->mm->locked_vm -= npages;
> +	down_write(&mm->mmap_sem);
> +	if (WARN_ON_ONCE(npages > mm->locked_vm))
> +		npages = mm->locked_vm;
> +	mm->locked_vm -= npages;
>  	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			current->mm->locked_vm << PAGE_SHIFT,
> +			mm->locked_vm << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK));
> -	up_write(&current->mm->mmap_sem);
> +	up_write(&mm->mmap_sem);
>  }
>  
>  /*
> @@ -98,6 +95,7 @@ struct tce_container {
>  	bool enabled;
>  	bool v2;
>  	unsigned long locked_pages;
> +	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
>  };
> @@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(current->mm, mem);
> +	return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>  	unsigned long entries = size >> PAGE_SHIFT;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	ret = mm_iommu_get(current->mm, vaddr, entries, &mem);
> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
>  
> @@ -143,7 +135,8 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	return 0;
>  }
>  
> -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
> +		struct mm_struct *mm)
>  {
>  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
>  			tbl->it_size, PAGE_SIZE);
> @@ -152,13 +145,13 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
>  
>  	BUG_ON(tbl->it_userspace);
>  
> -	ret = try_increment_locked_vm(cb >> PAGE_SHIFT);
> +	ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
>  	if (ret)
>  		return ret;
>  
>  	uas = vzalloc(cb);
>  	if (!uas) {
> -		decrement_locked_vm(cb >> PAGE_SHIFT);
> +		decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  		return -ENOMEM;
>  	}
>  	tbl->it_userspace = uas;
> @@ -166,7 +159,8 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
>  	return 0;
>  }
>  
> -static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
> +static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
> +		struct mm_struct *mm)
>  {
>  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
>  			tbl->it_size, PAGE_SIZE);
> @@ -176,7 +170,7 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
>  
>  	vfree(tbl->it_userspace);
>  	tbl->it_userspace = NULL;
> -	decrement_locked_vm(cb >> PAGE_SHIFT);
> +	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
>  static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> @@ -236,9 +230,6 @@ static int tce_iommu_enable(struct tce_container *container)
>  	struct iommu_table_group *table_group;
>  	struct tce_iommu_group *tcegrp;
>  
> -	if (!current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (container->enabled)
>  		return -EBUSY;
>  
> @@ -284,7 +275,7 @@ static int tce_iommu_enable(struct tce_container *container)
>  		return -EPERM;
>  
>  	locked = table_group->tce32_size >> PAGE_SHIFT;
> -	ret = try_increment_locked_vm(locked);
> +	ret = try_increment_locked_vm(container->mm, locked);
>  	if (ret)
>  		return ret;
>  
> @@ -302,10 +293,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  
>  	container->enabled = false;
>  
> -	if (!current->mm)
> -		return;
> -
> -	decrement_locked_vm(container->locked_pages);
> +	decrement_locked_vm(container->mm, container->locked_pages);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> +	/* current->mm cannot be NULL in this context */
> +	container->mm = current->mm;
> +	atomic_inc(&container->mm->mm_count);
> +
>  	return container;
>  }
>  
>  static int tce_iommu_clear(struct tce_container *container,
>  		struct iommu_table *tbl,
>  		unsigned long entry, unsigned long pages);
> -static void tce_iommu_free_table(struct iommu_table *tbl);
> +static void tce_iommu_free_table(struct tce_container *container,
> +		struct iommu_table *tbl);
>  
>  static void tce_iommu_release(void *iommu_data)
>  {
> @@ -357,10 +350,11 @@ static void tce_iommu_release(void *iommu_data)
>  			continue;
>  
>  		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> -		tce_iommu_free_table(tbl);
> +		tce_iommu_free_table(container, tbl);
>  	}
>  
>  	tce_iommu_disable(container);
> +	mmdrop(container->mm);
>  	mutex_destroy(&container->lock);
>  
>  	kfree(container);
> @@ -375,13 +369,14 @@ static void tce_iommu_unuse_page(struct tce_container *container,
>  	put_page(page);
>  }
>  
> -static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
> +static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
> +		unsigned long tce, unsigned long size,
>  		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem;
>  
> -	mem = mm_iommu_lookup(current->mm, tce, size);
> +	mem = mm_iommu_lookup(container->mm, tce, size);
>  	if (!mem)
>  		return -EINVAL;
>  
> @@ -394,18 +389,18 @@ static int tce_iommu_prereg_ua_to_hpa(unsigned long tce, unsigned long size,
>  	return 0;
>  }
>  
> -static void tce_iommu_unuse_page_v2(struct iommu_table *tbl,
> -		unsigned long entry)
> +static void tce_iommu_unuse_page_v2(struct tce_container *container,
> +		struct iommu_table *tbl, unsigned long entry)
>  {
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>  	int ret;
>  	unsigned long hpa = 0;
>  	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
>  
> -	if (!pua || !current || !current->mm)
> +	if (!pua)
>  		return;
>  
> -	ret = tce_iommu_prereg_ua_to_hpa(*pua, IOMMU_PAGE_SIZE(tbl),
> +	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
>  			&hpa, &mem);
>  	if (ret)
>  		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
> @@ -435,7 +430,7 @@ static int tce_iommu_clear(struct tce_container *container,
>  			continue;
>  
>  		if (container->v2) {
> -			tce_iommu_unuse_page_v2(tbl, entry);
> +			tce_iommu_unuse_page_v2(container, tbl, entry);
>  			continue;
>  		}
>  
> @@ -520,8 +515,8 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
>  				entry + i);
>  
> -		ret = tce_iommu_prereg_ua_to_hpa(tce, IOMMU_PAGE_SIZE(tbl),
> -				&hpa, &mem);
> +		ret = tce_iommu_prereg_ua_to_hpa(container,
> +				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
>  		if (ret)
>  			break;
>  
> @@ -542,7 +537,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp);
>  		if (ret) {
>  			/* dirtmp cannot be DMA_NONE here */
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  			pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
>  					__func__, entry << tbl->it_page_shift,
>  					tce, ret);
> @@ -550,7 +545,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
>  		}
>  
>  		if (dirtmp != DMA_NONE)
> -			tce_iommu_unuse_page_v2(tbl, entry + i);
> +			tce_iommu_unuse_page_v2(container, tbl, entry + i);
>  
>  		*pua = tce;
>  
> @@ -578,7 +573,7 @@ static long tce_iommu_create_table(struct tce_container *container,
>  	if (!table_size)
>  		return -EINVAL;
>  
> -	ret = try_increment_locked_vm(table_size >> PAGE_SHIFT);
> +	ret = try_increment_locked_vm(container->mm, table_size >> PAGE_SHIFT);
>  	if (ret)
>  		return ret;
>  
> @@ -589,24 +584,25 @@ static long tce_iommu_create_table(struct tce_container *container,
>  	WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
>  
>  	if (!ret && container->v2) {
> -		ret = tce_iommu_userspace_view_alloc(*ptbl);
> +		ret = tce_iommu_userspace_view_alloc(*ptbl, container->mm);
>  		if (ret)
>  			(*ptbl)->it_ops->free(*ptbl);
>  	}
>  
>  	if (ret)
> -		decrement_locked_vm(table_size >> PAGE_SHIFT);
> +		decrement_locked_vm(container->mm, table_size >> PAGE_SHIFT);
>  
>  	return ret;
>  }
>  
> -static void tce_iommu_free_table(struct iommu_table *tbl)
> +static void tce_iommu_free_table(struct tce_container *container,
> +		struct iommu_table *tbl)
>  {
>  	unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
> -	tce_iommu_userspace_view_free(tbl);
> +	tce_iommu_userspace_view_free(tbl, container->mm);
>  	tbl->it_ops->free(tbl);
> -	decrement_locked_vm(pages);
> +	decrement_locked_vm(container->mm, pages);
>  }
>  
>  static long tce_iommu_create_window(struct tce_container *container,
> @@ -669,7 +665,7 @@ static long tce_iommu_create_window(struct tce_container *container,
>  		table_group = iommu_group_get_iommudata(tcegrp->grp);
>  		table_group->ops->unset_window(table_group, num);
>  	}
> -	tce_iommu_free_table(tbl);
> +	tce_iommu_free_table(container, tbl);
>  
>  	return ret;
>  }
> @@ -707,7 +703,7 @@ static long tce_iommu_remove_window(struct tce_container *container,
>  
>  	/* Free table */
>  	tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> -	tce_iommu_free_table(tbl);
> +	tce_iommu_free_table(container, tbl);
>  	container->tables[num] = NULL;
>  
>  	return 0;
> @@ -720,8 +716,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>  	unsigned long minsz, ddwsz;
>  	long ret;
>  
> -	switch (cmd) {
> -	case VFIO_CHECK_EXTENSION:
> +	if (cmd == VFIO_CHECK_EXTENSION) {
>  		switch (arg) {
>  		case VFIO_SPAPR_TCE_IOMMU:
>  		case VFIO_SPAPR_TCE_v2_IOMMU:
> @@ -733,7 +728,13 @@ static long tce_iommu_ioctl(void *iommu_data,
>  		}
>  
>  		return (ret < 0) ? 0 : ret;
> +	}
>  
> +	/* tce_iommu_open() initializes container->mm so it can't be NULL here */
> +	if (container->mm != current->mm)
> +		return -ESRCH;
> +
> +	switch (cmd) {
>  	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
>  		struct vfio_iommu_spapr_tce_info info;
>  		struct tce_iommu_group *tcegrp;
> @@ -1049,7 +1050,7 @@ static void tce_iommu_release_ownership(struct tce_container *container,
>  			continue;
>  
>  		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> -		tce_iommu_userspace_view_free(tbl);
> +		tce_iommu_userspace_view_free(tbl, container->mm);
>  		if (tbl->it_map)
>  			iommu_release_ownership(tbl);
>  
> @@ -1068,7 +1069,7 @@ static int tce_iommu_take_ownership(struct tce_container *container,
>  		if (!tbl || !tbl->it_map)
>  			continue;
>  
> -		rc = tce_iommu_userspace_view_alloc(tbl);
> +		rc = tce_iommu_userspace_view_alloc(tbl, container->mm);
>  		if (!rc)
>  			rc = iommu_take_ownership(tbl);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  2016-10-24  6:53 ` [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown Alexey Kardashevskiy
  2016-10-25  4:44   ` David Gibson
@ 2016-11-08  3:35   ` David Gibson
  1 sibling, 0 replies; 21+ messages in thread
From: David Gibson @ 2016-11-08  3:35 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, Alex Williamson, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 7713 bytes --]

On Mon, Oct 24, 2016 at 05:53:10PM +1100, Alexey Kardashevskiy wrote:
> At the moment the userspace tool is expected to request pinning of
> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> When the userspace process finishes, all the pinned pages need to
> be put; this is done as a part of the userspace memory context (MM)
> destruction which happens on the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This moves preregistered regions tracking from MM to VFIO; insteads of
> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> added so each container releases regions which it has pre-registered.
> 
> This changes the userspace interface to return EBUSY if a memory
> region is already registered in a container. However it should not
> have any practical effect as the only userspace tool available now
> does register memory region once per container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Sorry for the delay.  I got hung up on the fact that the regions
reserved by one container can cause strange behaviour on unrelated
containers (one container can't reserve a region overlapping with a
region reserved by another container).

That limitation is awful, and I wish I'd spotted it when we first
implemented this.  But, it's not new with this patch series, so it's
not a reason not to go ahead with this series which does fix a real
bug.



> ---
> Changes:
> v4:
> * changed tce_iommu_register_pages() to call mm_iommu_find() first and
> avoid calling mm_iommu_put() if memory is preregistered already
> 
> v3:
> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> 
> v2:
> * updated commit log
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c    | 11 -------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 58 +++++++++++++++++++++++++++++++++-
>  3 files changed, 57 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> -	mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>  	drop_cop(mm->context.acop, mm);
>  	kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index 4c6db09..104bad0 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>  	INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> -	struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> -	list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list,
> -			next) {
> -		list_del_rcu(&mem->next);
> -		mm_iommu_do_free(mem);
> -	}
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 81ab93f..001a488 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered region  it has
> + * referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> +	struct list_head next;
> +	struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -98,12 +107,27 @@ struct tce_container {
>  	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
> +	struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> +		struct tce_iommu_prereg *tcemem)
> +{
> +	long ret;
> +
> +	list_del(&tcemem->next);
> +	ret = mm_iommu_put(container->mm, tcemem->mem);
> +	kfree(tcemem);
> +
> +	return ret;
> +}
> +
>  static long tce_iommu_unregister_pages(struct tce_container *container,
>  		__u64 vaddr, __u64 size)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> +	struct tce_iommu_prereg *tcemem;
> +	bool found = false;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
> @@ -112,7 +136,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(container->mm, mem);
> +	list_for_each_entry(tcemem, &container->prereg_list, next) {
> +		if (tcemem->mem == mem) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found)
> +		return -ENOENT;
> +
> +	return tce_iommu_prereg_free(container, tcemem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -120,16 +154,29 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  {
>  	long ret = 0;
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	struct tce_iommu_prereg *tcemem;
>  	unsigned long entries = size >> PAGE_SHIFT;
>  
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> +	mem = mm_iommu_find(container->mm, vaddr, entries);
> +	if (mem) {
> +		list_for_each_entry(tcemem, &container->prereg_list, next) {
> +			if (tcemem->mem == mem)
> +				return -EBUSY;
> +		}
> +	}
> +
>  	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
>  
> +	tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL);
> +	tcemem->mem = mem;
> +	list_add(&tcemem->next, &container->prereg_list);
> +
>  	container->enabled = true;
>  
>  	return 0;
> @@ -311,6 +358,7 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	mutex_init(&container->lock);
>  	INIT_LIST_HEAD_RCU(&container->group_list);
> +	INIT_LIST_HEAD_RCU(&container->prereg_list);
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> @@ -353,6 +401,14 @@ static void tce_iommu_release(void *iommu_data)
>  		tce_iommu_free_table(container, tbl);
>  	}
>  
> +	while (!list_empty(&container->prereg_list)) {
> +		struct tce_iommu_prereg *tcemem;
> +
> +		tcemem = list_first_entry(&container->prereg_list,
> +				struct tce_iommu_prereg, next);
> +		tce_iommu_prereg_free(container, tcemem);
> +	}
> +
>  	tce_iommu_disable(container);
>  	mmdrop(container->mm);
>  	mutex_destroy(&container->lock);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown
  2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2016-10-24  6:53 ` [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown Alexey Kardashevskiy
@ 2016-11-08  7:54 ` Michael Ellerman
  2016-11-08 23:06   ` Alex Williamson
  4 siblings, 1 reply; 21+ messages in thread
From: Michael Ellerman @ 2016-11-08  7:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras,
	Nicholas Piggin, David Gibson

Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> These patches are to fix a bug when pages stay pinned hours
> after QEMU which requested pinning exited.
>
> Please comment. Thanks.
>
> Alexey Kardashevskiy (4):
>   powerpc/iommu: Pass mm_struct to init/cleanup helpers
>   powerpc/iommu: Stop using @current in mm_iommu_xxx
>   vfio/spapr: Reference mm in tce_container
>   powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
>
>  arch/powerpc/include/asm/mmu_context.h |  20 ++--
>  arch/powerpc/kernel/setup-common.c     |   2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
>  arch/powerpc/mm/mmu_context_iommu.c    |  60 ++++-------
>  drivers/vfio/vfio_iommu_spapr_tce.c    | 181 ++++++++++++++++++++++-----------
>  5 files changed, 154 insertions(+), 115 deletions(-)


Alex, given the diffstat how do you want to merge this? (for 4.10)

Should I merge it, you merge it, or I can put it in a topic branch?

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container
  2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
  2016-10-24  7:14   ` Nicholas Piggin
  2016-11-08  3:33   ` David Gibson
@ 2016-11-08 22:25   ` Alex Williamson
  2016-11-09  0:46     ` David Gibson
  2 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2016-11-08 22:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, Nicholas Piggin, Paul Mackerras

On Mon, 24 Oct 2016 17:53:09 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better have tce_container take a reference to mm_struct and
> use it later when the process is gone (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from making changes to a memory
> context of other processes; in order to add this check,
> VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> quite special anyway - it is the only ioctl() called when neither
> container nor container->mm is initialized.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added check for container->mm!=current->mm in tce_iommu_ioctl()
> for all ioctls and removed other redundand checks
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 131 ++++++++++++++++++------------------
>  1 file changed, 66 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d0c38b2..81ab93f 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -31,49 +31,46 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>  		struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(long npages)
> +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  {
>  	long ret = 0, locked, lock_limit;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (!npages)
>  		return 0;
>  
> -	down_write(&current->mm->mmap_sem);
> -	locked = current->mm->locked_vm + npages;
> +	down_write(&mm->mmap_sem);
> +	locked = mm->locked_vm + npages;
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>  		ret = -ENOMEM;
>  	else
> -		current->mm->locked_vm += npages;
> +		mm->locked_vm += npages;
>  
>  	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			current->mm->locked_vm << PAGE_SHIFT,
> +			mm->locked_vm << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK),
>  			ret ? " - exceeded" : "");
>  
> -	up_write(&current->mm->mmap_sem);
> +	up_write(&mm->mmap_sem);
>  
>  	return ret;
>  }
>  
> -static void decrement_locked_vm(long npages)
> +static void decrement_locked_vm(struct mm_struct *mm, long npages)
>  {
> -	if (!current || !current->mm || !npages)
> -		return; /* process exited */
> +	if (!npages)
> +		return;
>  
> -	down_write(&current->mm->mmap_sem);
> -	if (WARN_ON_ONCE(npages > current->mm->locked_vm))
> -		npages = current->mm->locked_vm;
> -	current->mm->locked_vm -= npages;
> +	down_write(&mm->mmap_sem);
> +	if (WARN_ON_ONCE(npages > mm->locked_vm))
> +		npages = mm->locked_vm;
> +	mm->locked_vm -= npages;
>  	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
>  			npages << PAGE_SHIFT,
> -			current->mm->locked_vm << PAGE_SHIFT,
> +			mm->locked_vm << PAGE_SHIFT,
>  			rlimit(RLIMIT_MEMLOCK));
> -	up_write(&current->mm->mmap_sem);
> +	up_write(&mm->mmap_sem);
>  }
>  
>  /*
> @@ -98,6 +95,7 @@ struct tce_container {
>  	bool enabled;
>  	bool v2;
>  	unsigned long locked_pages;
> +	struct mm_struct *mm;
>  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>  	struct list_head group_list;
>  };
> @@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>  		return -EINVAL;
>  
> -	mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
> +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>  	if (!mem)
>  		return -ENOENT;
>  
> -	return mm_iommu_put(current->mm, mem);
> +	return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	struct mm_iommu_table_group_mem_t *mem = NULL;
>  	unsigned long entries = size >> PAGE_SHIFT;
>  
> -	if (!current || !current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
>  			((vaddr + size) < vaddr))
>  		return -EINVAL;
>  
> -	ret = mm_iommu_get(current->mm, vaddr, entries, &mem);
> +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
>  	if (ret)
>  		return ret;
>  
> @@ -143,7 +135,8 @@ static long tce_iommu_register_pages(struct tce_container *container,
>  	return 0;
>  }
>  
> -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
> +		struct mm_struct *mm)
>  {
>  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
>  			tbl->it_size, PAGE_SIZE);
> @@ -152,13 +145,13 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
>  
>  	BUG_ON(tbl->it_userspace);
>  
> -	ret = try_increment_locked_vm(cb >> PAGE_SHIFT);
> +	ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
>  	if (ret)
>  		return ret;
>  
>  	uas = vzalloc(cb);
>  	if (!uas) {
> -		decrement_locked_vm(cb >> PAGE_SHIFT);
> +		decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  		return -ENOMEM;
>  	}
>  	tbl->it_userspace = uas;
> @@ -166,7 +159,8 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
>  	return 0;
>  }
>  
> -static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
> +static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
> +		struct mm_struct *mm)
>  {
>  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
>  			tbl->it_size, PAGE_SIZE);
> @@ -176,7 +170,7 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
>  
>  	vfree(tbl->it_userspace);
>  	tbl->it_userspace = NULL;
> -	decrement_locked_vm(cb >> PAGE_SHIFT);
> +	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
>  }
>  
>  static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> @@ -236,9 +230,6 @@ static int tce_iommu_enable(struct tce_container *container)
>  	struct iommu_table_group *table_group;
>  	struct tce_iommu_group *tcegrp;
>  
> -	if (!current->mm)
> -		return -ESRCH; /* process exited */
> -
>  	if (container->enabled)
>  		return -EBUSY;
>  
> @@ -284,7 +275,7 @@ static int tce_iommu_enable(struct tce_container *container)
>  		return -EPERM;
>  
>  	locked = table_group->tce32_size >> PAGE_SHIFT;
> -	ret = try_increment_locked_vm(locked);
> +	ret = try_increment_locked_vm(container->mm, locked);
>  	if (ret)
>  		return ret;
>  
> @@ -302,10 +293,7 @@ static void tce_iommu_disable(struct tce_container *container)
>  
>  	container->enabled = false;
>  
> -	if (!current->mm)
> -		return;
> -
> -	decrement_locked_vm(container->locked_pages);
> +	decrement_locked_vm(container->mm, container->locked_pages);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)
> @@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
>  
>  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> +	/* current->mm cannot be NULL in this context */
> +	container->mm = current->mm;
> +	atomic_inc(&container->mm->mm_count);

Are you sure you wouldn't rather do this on the actual
preregistration?  The advise I gave to Kirti for mdev was that it's
currently possible to have a configuration where a privileged user
opens a container, adds a group, sets the iommu, and then passes the
file descriptors to another user.  We can then set an mm once mappings,
or preregistration occurs, and from there require that the unmapping mm
matches the mapping mm, or that both of those match the preregistration
mm.  I'm not sure I see any reason to impose that current->mm that
performs this operation is the only one that's allowed to perform those
later tasks.

> +
>  	return container;
>  }
>  
>  static int tce_iommu_clear(struct tce_container *container,
>  		struct iommu_table *tbl,
>  		unsigned long entry, unsigned long pages);
> -static void tce_iommu_free_table(struct iommu_table *tbl);
> +static void tce_iommu_free_table(struct tce_container *container,
> +		struct iommu_table *tbl);
>  
>  static void tce_iommu_release(void *iommu_data)
>  {
> @@ -357,10 +350,11 @@ static void tce_iommu_release(void *iommu_data)
>  			continue;
>  
>  		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> -		tce_iommu_free_table(tbl);
> +		tce_iommu_free_table(container, tbl);
>  	}
>  
>  	tce_iommu_disable(container);
> +	mmdrop(container->mm);

I imagine you'd still do this here, just:

	if (container->mm)
		mmdrop(container->mm);

Of course you'd need to check how many places you'd need similar
tests, maybe some of the current->mm tests would be converted to
container->mm rather than dropped.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown
  2016-11-08  7:54 ` [PATCH kernel v4 0/4] powerpc/spapr/vfio: " Michael Ellerman
@ 2016-11-08 23:06   ` Alex Williamson
  2016-11-10  1:37     ` Michael Ellerman
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2016-11-08 23:06 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras,
	Nicholas Piggin, David Gibson

On Tue, 08 Nov 2016 18:54:28 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> > These patches are to fix a bug when pages stay pinned hours
> > after QEMU which requested pinning exited.
> >
> > Please comment. Thanks.
> >
> > Alexey Kardashevskiy (4):
> >   powerpc/iommu: Pass mm_struct to init/cleanup helpers
> >   powerpc/iommu: Stop using @current in mm_iommu_xxx
> >   vfio/spapr: Reference mm in tce_container
> >   powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
> >
> >  arch/powerpc/include/asm/mmu_context.h |  20 ++--
> >  arch/powerpc/kernel/setup-common.c     |   2 +-
> >  arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
> >  arch/powerpc/mm/mmu_context_iommu.c    |  60 ++++-------
> >  drivers/vfio/vfio_iommu_spapr_tce.c    | 181 ++++++++++++++++++++++-----------
> >  5 files changed, 154 insertions(+), 115 deletions(-)  
> 
> 
> Alex, given the diffstat how do you want to merge this? (for 4.10)
> 
> Should I merge it, you merge it, or I can put it in a topic branch?

I have an outstanding question on one patch, but otherwise I'd prefer
you merge it, I'm not expecting to have anything that would conflict
with vfio_iommu_spapr_tce.c and this code is tied pretty tightly with
arch code.  I'll provide acks depending on how we resolve my question.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container
  2016-11-08 22:25   ` Alex Williamson
@ 2016-11-09  0:46     ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2016-11-09  0:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Nicholas Piggin, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 10753 bytes --]

On Tue, Nov 08, 2016 at 03:25:19PM -0700, Alex Williamson wrote:
> On Mon, 24 Oct 2016 17:53:09 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > In some situations the userspace memory context may live longer than
> > the userspace process itself so if we need to do proper memory context
> > cleanup, we better have tce_container take a reference to mm_struct and
> > use it later when the process is gone (@current or @current->mm is NULL).
> > 
> > This references mm and stores the pointer in the container; this is done
> > when a container is just created so checking for !current->mm in other
> > places becomes pointless.
> > 
> > This replaces current->mm with container->mm everywhere except debug
> > prints.
> > 
> > This adds a check that current->mm is the same as the one stored in
> > the container to prevent userspace from making changes to a memory
> > context of other processes; in order to add this check,
> > VFIO_CHECK_EXTENSION is moved out of the switch(cmd) as it is
> > quite special anyway - it is the only ioctl() called when neither
> > container nor container->mm is initialized.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > Changes:
> > v4:
> > * added check for container->mm!=current->mm in tce_iommu_ioctl()
> > for all ioctls and removed other redundand checks
> > ---
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 131 ++++++++++++++++++------------------
> >  1 file changed, 66 insertions(+), 65 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index d0c38b2..81ab93f 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -31,49 +31,46 @@
> >  static void tce_iommu_detach_group(void *iommu_data,
> >  		struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(long npages)
> > +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> >  {
> >  	long ret = 0, locked, lock_limit;
> >  
> > -	if (!current || !current->mm)
> > -		return -ESRCH; /* process exited */
> > -
> >  	if (!npages)
> >  		return 0;
> >  
> > -	down_write(&current->mm->mmap_sem);
> > -	locked = current->mm->locked_vm + npages;
> > +	down_write(&mm->mmap_sem);
> > +	locked = mm->locked_vm + npages;
> >  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >  	if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> >  		ret = -ENOMEM;
> >  	else
> > -		current->mm->locked_vm += npages;
> > +		mm->locked_vm += npages;
> >  
> >  	pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> >  			npages << PAGE_SHIFT,
> > -			current->mm->locked_vm << PAGE_SHIFT,
> > +			mm->locked_vm << PAGE_SHIFT,
> >  			rlimit(RLIMIT_MEMLOCK),
> >  			ret ? " - exceeded" : "");
> >  
> > -	up_write(&current->mm->mmap_sem);
> > +	up_write(&mm->mmap_sem);
> >  
> >  	return ret;
> >  }
> >  
> > -static void decrement_locked_vm(long npages)
> > +static void decrement_locked_vm(struct mm_struct *mm, long npages)
> >  {
> > -	if (!current || !current->mm || !npages)
> > -		return; /* process exited */
> > +	if (!npages)
> > +		return;
> >  
> > -	down_write(&current->mm->mmap_sem);
> > -	if (WARN_ON_ONCE(npages > current->mm->locked_vm))
> > -		npages = current->mm->locked_vm;
> > -	current->mm->locked_vm -= npages;
> > +	down_write(&mm->mmap_sem);
> > +	if (WARN_ON_ONCE(npages > mm->locked_vm))
> > +		npages = mm->locked_vm;
> > +	mm->locked_vm -= npages;
> >  	pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> >  			npages << PAGE_SHIFT,
> > -			current->mm->locked_vm << PAGE_SHIFT,
> > +			mm->locked_vm << PAGE_SHIFT,
> >  			rlimit(RLIMIT_MEMLOCK));
> > -	up_write(&current->mm->mmap_sem);
> > +	up_write(&mm->mmap_sem);
> >  }
> >  
> >  /*
> > @@ -98,6 +95,7 @@ struct tce_container {
> >  	bool enabled;
> >  	bool v2;
> >  	unsigned long locked_pages;
> > +	struct mm_struct *mm;
> >  	struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> >  	struct list_head group_list;
> >  };
> > @@ -107,17 +105,14 @@ static long tce_iommu_unregister_pages(struct tce_container *container,
> >  {
> >  	struct mm_iommu_table_group_mem_t *mem;
> >  
> > -	if (!current || !current->mm)
> > -		return -ESRCH; /* process exited */
> > -
> >  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
> >  		return -EINVAL;
> >  
> > -	mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
> > +	mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
> >  	if (!mem)
> >  		return -ENOENT;
> >  
> > -	return mm_iommu_put(current->mm, mem);
> > +	return mm_iommu_put(container->mm, mem);
> >  }
> >  
> >  static long tce_iommu_register_pages(struct tce_container *container,
> > @@ -127,14 +122,11 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >  	struct mm_iommu_table_group_mem_t *mem = NULL;
> >  	unsigned long entries = size >> PAGE_SHIFT;
> >  
> > -	if (!current || !current->mm)
> > -		return -ESRCH; /* process exited */
> > -
> >  	if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK) ||
> >  			((vaddr + size) < vaddr))
> >  		return -EINVAL;
> >  
> > -	ret = mm_iommu_get(current->mm, vaddr, entries, &mem);
> > +	ret = mm_iommu_get(container->mm, vaddr, entries, &mem);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -143,7 +135,8 @@ static long tce_iommu_register_pages(struct tce_container *container,
> >  	return 0;
> >  }
> >  
> > -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> > +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
> > +		struct mm_struct *mm)
> >  {
> >  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
> >  			tbl->it_size, PAGE_SIZE);
> > @@ -152,13 +145,13 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> >  
> >  	BUG_ON(tbl->it_userspace);
> >  
> > -	ret = try_increment_locked_vm(cb >> PAGE_SHIFT);
> > +	ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
> >  	if (ret)
> >  		return ret;
> >  
> >  	uas = vzalloc(cb);
> >  	if (!uas) {
> > -		decrement_locked_vm(cb >> PAGE_SHIFT);
> > +		decrement_locked_vm(mm, cb >> PAGE_SHIFT);
> >  		return -ENOMEM;
> >  	}
> >  	tbl->it_userspace = uas;
> > @@ -166,7 +159,8 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> >  	return 0;
> >  }
> >  
> > -static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
> > +static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
> > +		struct mm_struct *mm)
> >  {
> >  	unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
> >  			tbl->it_size, PAGE_SIZE);
> > @@ -176,7 +170,7 @@ static void tce_iommu_userspace_view_free(struct iommu_table *tbl)
> >  
> >  	vfree(tbl->it_userspace);
> >  	tbl->it_userspace = NULL;
> > -	decrement_locked_vm(cb >> PAGE_SHIFT);
> > +	decrement_locked_vm(mm, cb >> PAGE_SHIFT);
> >  }
> >  
> >  static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> > @@ -236,9 +230,6 @@ static int tce_iommu_enable(struct tce_container *container)
> >  	struct iommu_table_group *table_group;
> >  	struct tce_iommu_group *tcegrp;
> >  
> > -	if (!current->mm)
> > -		return -ESRCH; /* process exited */
> > -
> >  	if (container->enabled)
> >  		return -EBUSY;
> >  
> > @@ -284,7 +275,7 @@ static int tce_iommu_enable(struct tce_container *container)
> >  		return -EPERM;
> >  
> >  	locked = table_group->tce32_size >> PAGE_SHIFT;
> > -	ret = try_increment_locked_vm(locked);
> > +	ret = try_increment_locked_vm(container->mm, locked);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -302,10 +293,7 @@ static void tce_iommu_disable(struct tce_container *container)
> >  
> >  	container->enabled = false;
> >  
> > -	if (!current->mm)
> > -		return;
> > -
> > -	decrement_locked_vm(container->locked_pages);
> > +	decrement_locked_vm(container->mm, container->locked_pages);
> >  }
> >  
> >  static void *tce_iommu_open(unsigned long arg)
> > @@ -326,13 +314,18 @@ static void *tce_iommu_open(unsigned long arg)
> >  
> >  	container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >  
> > +	/* current->mm cannot be NULL in this context */
> > +	container->mm = current->mm;
> > +	atomic_inc(&container->mm->mm_count);
> 
> Are you sure you wouldn't rather do this on the actual
> preregistration?  The advise I gave to Kirti for mdev was that it's
> currently possible to have a configuration where a privileged user
> opens a container, adds a group, sets the iommu, and then passes the
> file descriptors to another user.  We can then set an mm once mappings,
> or preregistration occurs, and from there require that the unmapping mm
> matches the mapping mm, or that both of those match the preregistration
> mm.  I'm not sure I see any reason to impose that current->mm that
> performs this operation is the only one that's allowed to perform those
> later tasks.
> 
> > +
> >  	return container;
> >  }
> >  
> >  static int tce_iommu_clear(struct tce_container *container,
> >  		struct iommu_table *tbl,
> >  		unsigned long entry, unsigned long pages);
> > -static void tce_iommu_free_table(struct iommu_table *tbl);
> > +static void tce_iommu_free_table(struct tce_container *container,
> > +		struct iommu_table *tbl);
> >  
> >  static void tce_iommu_release(void *iommu_data)
> >  {
> > @@ -357,10 +350,11 @@ static void tce_iommu_release(void *iommu_data)
> >  			continue;
> >  
> >  		tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> > -		tce_iommu_free_table(tbl);
> > +		tce_iommu_free_table(container, tbl);
> >  	}
> >  
> >  	tce_iommu_disable(container);
> > +	mmdrop(container->mm);
> 
> I imagine you'd still do this here, just:
> 
> 	if (container->mm)
> 		mmdrop(container->mm);
> 
> Of course you'd need to check how many places you'd need similar
> tests, maybe some of the current->mm tests would be converted to
> container->mm rather than dropped.  Thanks,

Right, as you've implied here, the advantage of locking the mm on open
is simplicity - open always happens, and it always happens before
anything else, so there's no conditionals about whether we have or
haven't set the mm yet.

That said, you have just described a plausible use case where it's
useful to have one mm open the container, then another one use it, so
it might be worth adding that ability.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown
  2016-11-08 23:06   ` Alex Williamson
@ 2016-11-10  1:37     ` Michael Ellerman
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2016-11-10  1:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, linuxppc-dev, Paul Mackerras,
	Nicholas Piggin, David Gibson

Alex Williamson <alex.williamson@redhat.com> writes:
> On Tue, 08 Nov 2016 18:54:28 +1100
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>> >  arch/powerpc/include/asm/mmu_context.h |  20 ++--
>> >  arch/powerpc/kernel/setup-common.c     |   2 +-
>> >  arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
>> >  arch/powerpc/mm/mmu_context_iommu.c    |  60 ++++-------
>> >  drivers/vfio/vfio_iommu_spapr_tce.c    | 181 ++++++++++++++++++++++-----------
>> >  5 files changed, 154 insertions(+), 115 deletions(-)  
>> 
>> Alex, given the diffstat how do you want to merge this? (for 4.10)
>> 
>> Should I merge it, you merge it, or I can put it in a topic branch?
>
> I have an outstanding question on one patch, but otherwise I'd prefer
> you merge it, I'm not expecting to have anything that would conflict
> with vfio_iommu_spapr_tce.c and this code is tied pretty tightly with
> arch code.  I'll provide acks depending on how we resolve my question.

Great, thanks.

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-11-10  1:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24  6:53 [PATCH kernel v4 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown Alexey Kardashevskiy
2016-10-24  6:53 ` [PATCH kernel v4 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers Alexey Kardashevskiy
2016-10-24  6:53 ` [PATCH kernel v4 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx Alexey Kardashevskiy
2016-10-24  6:53 ` [PATCH kernel v4 3/4] vfio/spapr: Reference mm in tce_container Alexey Kardashevskiy
2016-10-24  7:14   ` Nicholas Piggin
2016-11-08  3:33   ` David Gibson
2016-11-08 22:25   ` Alex Williamson
2016-11-09  0:46     ` David Gibson
2016-10-24  6:53 ` [PATCH kernel v4 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown Alexey Kardashevskiy
2016-10-25  4:44   ` David Gibson
2016-10-25  4:55     ` Alexey Kardashevskiy
2016-10-31  3:13       ` David Gibson
2016-10-31  4:13         ` Alexey Kardashevskiy
2016-10-31  4:23           ` David Gibson
2016-11-02  2:44             ` Alexey Kardashevskiy
2016-11-03  1:02               ` Paul Mackerras
2016-11-08  3:33                 ` David Gibson
2016-11-08  3:35   ` David Gibson
2016-11-08  7:54 ` [PATCH kernel v4 0/4] powerpc/spapr/vfio: " Michael Ellerman
2016-11-08 23:06   ` Alex Williamson
2016-11-10  1:37     ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.