All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH kernel v2 0/2] powerpc/mm_iommu: Fixes
@ 2019-04-03  4:12 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

The patches do independent things but touch exact same code so
the order in which they should apply matters.

This supercedes:
[PATCH kernel] powerpc/mm_iommu: Allow pinning large regions
[PATCH kernel 1/2] powerpc/mm_iommu: Prepare for less locking
[PATCH kernel 2/2] powerpc/mm_iommu: Fix potential deadlock


This is based on sha1
5e7a8ca31926 Linus Torvalds "Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc/mm_iommu: Fix potential deadlock
  powerpc/mm_iommu: Allow pinning large regions

 arch/powerpc/mm/mmu_context_iommu.c | 97 +++++++++++++++++------------
 1 file changed, 58 insertions(+), 39 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH kernel v2 0/2] powerpc/mm_iommu: Fixes
@ 2019-04-03  4:12 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

The patches do independent things but touch exact same code so
the order in which they should apply matters.

This supercedes:
[PATCH kernel] powerpc/mm_iommu: Allow pinning large regions
[PATCH kernel 1/2] powerpc/mm_iommu: Prepare for less locking
[PATCH kernel 2/2] powerpc/mm_iommu: Fix potential deadlock


This is based on sha1
5e7a8ca31926 Linus Torvalds "Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc/mm_iommu: Fix potential deadlock
  powerpc/mm_iommu: Allow pinning large regions

 arch/powerpc/mm/mmu_context_iommu.c | 97 +++++++++++++++++------------
 1 file changed, 58 insertions(+), 39 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH kernel v2 1/2] powerpc/mm_iommu: Fix potential deadlock
  2019-04-03  4:12 ` Alexey Kardashevskiy
@ 2019-04-03  4:12   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

Currently mm_iommu_do_alloc() is called in 2 cases:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
	this locks &mem_list_mutex and then locks mm::mmap_sem
	several times when adjusting locked_vm or pinning pages;
- vfio_pci_nvgpu_regops::mmap() for GPU memory:
	this is called with mm::mmap_sem held already and it locks
	&mem_list_mutex.

So one can craft a userspace program to do special ioctl and mmap in
2 threads concurrently and cause a deadlock which lockdep warns about
(below).

We did not hit this yet because QEMU constructs the machine in a single
thread.

This moves the overlap check next to where the new entry is added and
reduces the amount of time spent with &mem_list_mutex held.

This moves locked_vm adjustment from under &mem_list_mutex.

This relies on mm_iommu_adjust_locked_vm() doing nothing when entries==0.

This is one of the lockdep warnings:
======================================================
WARNING: possible circular locking dependency detected
5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
------------------------------------------------------
qemu-system-ppc/8038 is trying to acquire lock:
000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490

but task is already holding lock:
00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++}:
       lock_acquire+0xf8/0x260
       down_write+0x44/0xa0
       mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
       mm_iommu_do_alloc+0x310/0x490
       tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
       vfio_fops_unl_ioctl+0x94/0x430 [vfio]
       do_vfs_ioctl+0xe4/0x930
       ksys_ioctl+0xc4/0x110
       sys_ioctl+0x28/0x80
       system_call+0x5c/0x70

-> #0 (mem_list_mutex){+.+.}:
       __lock_acquire+0x1484/0x1900
       lock_acquire+0xf8/0x260
       __mutex_lock+0x88/0xa70
       mm_iommu_do_alloc+0x70/0x490
       vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
       vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
       vfio_device_fops_mmap+0x44/0x70 [vfio]
       mmap_region+0x5d4/0x770
       do_mmap+0x42c/0x650
       vm_mmap_pgoff+0x124/0x160
       ksys_mmap_pgoff+0xdc/0x2f0
       sys_mmap+0x40/0x80
       system_call+0x5c/0x70

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(mem_list_mutex);
                               lock(&mm->mmap_sem);
  lock(mem_list_mutex);

 *** DEADLOCK ***

1 lock held by qemu-system-ppc/8038:
 #0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

Fixes: c10c21efa4bc ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 75 +++++++++++++++--------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index e7a9c4f6bfca..9d9be850f8c2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -95,28 +95,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 			      unsigned long entries, unsigned long dev_hpa,
 			      struct mm_iommu_table_group_mem_t **pmem)
 {
-	struct mm_iommu_table_group_mem_t *mem;
-	long i, ret, locked_entries = 0;
+	struct mm_iommu_table_group_mem_t *mem, *mem2;
+	long i, ret, locked_entries = 0, pinned = 0;
 	unsigned int pageshift;
 
-	mutex_lock(&mem_list_mutex);
-
-	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
-			next) {
-		/* Overlap? */
-		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
-				(ua < (mem->ua +
-				       (mem->entries << PAGE_SHIFT)))) {
-			ret = -EINVAL;
-			goto unlock_exit;
-		}
-
-	}
-
 	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
 		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
 		if (ret)
-			goto unlock_exit;
+			return ret;
 
 		locked_entries = entries;
 	}
@@ -150,15 +136,10 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	down_read(&mm->mmap_sem);
 	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
 	up_read(&mm->mmap_sem);
+	pinned = ret > 0 ? ret : 0;
 	if (ret != entries) {
-		/* free the reference taken */
-		for (i = 0; i < ret; i++)
-			put_page(mem->hpages[i]);
-
-		vfree(mem->hpas);
-		kfree(mem);
 		ret = -EFAULT;
-		goto unlock_exit;
+		goto free_exit;
 	}
 
 	pageshift = PAGE_SHIFT;
@@ -183,21 +164,43 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	}
 
 good_exit:
-	ret = 0;
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
 	mem->entries = entries;
-	*pmem = mem;
+
+	mutex_lock(&mem_list_mutex);
+
+	list_for_each_entry_rcu(mem2, &mm->context.iommu_group_mem_list, next) {
+		/* Overlap? */
+		if ((mem2->ua < (ua + (entries << PAGE_SHIFT))) &&
+				(ua < (mem2->ua +
+				       (mem2->entries << PAGE_SHIFT)))) {
+			ret = -EINVAL;
+			mutex_unlock(&mem_list_mutex);
+			goto free_exit;
+		}
+	}
 
 	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
 
-unlock_exit:
-	if (locked_entries && ret)
-		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
-
 	mutex_unlock(&mem_list_mutex);
 
+	*pmem = mem;
+
+	return 0;
+
+free_exit:
+	/* free the reference taken */
+	for (i = 0; i < pinned; i++)
+		put_page(mem->hpages[i]);
+
+	vfree(mem->hpas);
+	kfree(mem);
+
+unlock_exit:
+	mm_iommu_adjust_locked_vm(mm, locked_entries, false);
+
 	return ret;
 }
 
@@ -266,7 +269,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
-	unsigned long entries, dev_hpa;
+	unsigned long unlock_entries = 0;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -287,17 +290,17 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 		goto unlock_exit;
 	}
 
+	if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
+		unlock_entries = mem->entries;
+
 	/* @mapped became 0 so now mappings are disabled, release the region */
-	entries = mem->entries;
-	dev_hpa = mem->dev_hpa;
 	mm_iommu_release(mem);
 
-	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
-		mm_iommu_adjust_locked_vm(mm, entries, false);
-
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
 
+	mm_iommu_adjust_locked_vm(mm, unlock_entries, false);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH kernel v2 1/2] powerpc/mm_iommu: Fix potential deadlock
@ 2019-04-03  4:12   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

Currently mm_iommu_do_alloc() is called in 2 cases:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
	this locks &mem_list_mutex and then locks mm::mmap_sem
	several times when adjusting locked_vm or pinning pages;
- vfio_pci_nvgpu_regops::mmap() for GPU memory:
	this is called with mm::mmap_sem held already and it locks
	&mem_list_mutex.

So one can craft a userspace program to do special ioctl and mmap in
2 threads concurrently and cause a deadlock which lockdep warns about
(below).

We did not hit this yet because QEMU constructs the machine in a single
thread.

This moves the overlap check next to where the new entry is added and
reduces the amount of time spent with &mem_list_mutex held.

This moves locked_vm adjustment from under &mem_list_mutex.

This relies on mm_iommu_adjust_locked_vm() doing nothing when entries=0.

This is one of the lockdep warnings:
===========================
WARNING: possible circular locking dependency detected
5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
------------------------------------------------------
qemu-system-ppc/8038 is trying to acquire lock:
000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490

but task is already holding lock:
00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++}:
       lock_acquire+0xf8/0x260
       down_write+0x44/0xa0
       mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
       mm_iommu_do_alloc+0x310/0x490
       tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
       vfio_fops_unl_ioctl+0x94/0x430 [vfio]
       do_vfs_ioctl+0xe4/0x930
       ksys_ioctl+0xc4/0x110
       sys_ioctl+0x28/0x80
       system_call+0x5c/0x70

-> #0 (mem_list_mutex){+.+.}:
       __lock_acquire+0x1484/0x1900
       lock_acquire+0xf8/0x260
       __mutex_lock+0x88/0xa70
       mm_iommu_do_alloc+0x70/0x490
       vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
       vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
       vfio_device_fops_mmap+0x44/0x70 [vfio]
       mmap_region+0x5d4/0x770
       do_mmap+0x42c/0x650
       vm_mmap_pgoff+0x124/0x160
       ksys_mmap_pgoff+0xdc/0x2f0
       sys_mmap+0x40/0x80
       system_call+0x5c/0x70

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(mem_list_mutex);
                               lock(&mm->mmap_sem);
  lock(mem_list_mutex);

 *** DEADLOCK ***

1 lock held by qemu-system-ppc/8038:
 #0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160

Fixes: c10c21efa4bc ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 75 +++++++++++++++--------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index e7a9c4f6bfca..9d9be850f8c2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -95,28 +95,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 			      unsigned long entries, unsigned long dev_hpa,
 			      struct mm_iommu_table_group_mem_t **pmem)
 {
-	struct mm_iommu_table_group_mem_t *mem;
-	long i, ret, locked_entries = 0;
+	struct mm_iommu_table_group_mem_t *mem, *mem2;
+	long i, ret, locked_entries = 0, pinned = 0;
 	unsigned int pageshift;
 
-	mutex_lock(&mem_list_mutex);
-
-	list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
-			next) {
-		/* Overlap? */
-		if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
-				(ua < (mem->ua +
-				       (mem->entries << PAGE_SHIFT)))) {
-			ret = -EINVAL;
-			goto unlock_exit;
-		}
-
-	}
-
 	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA) {
 		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
 		if (ret)
-			goto unlock_exit;
+			return ret;
 
 		locked_entries = entries;
 	}
@@ -150,15 +136,10 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	down_read(&mm->mmap_sem);
 	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
 	up_read(&mm->mmap_sem);
+	pinned = ret > 0 ? ret : 0;
 	if (ret != entries) {
-		/* free the reference taken */
-		for (i = 0; i < ret; i++)
-			put_page(mem->hpages[i]);
-
-		vfree(mem->hpas);
-		kfree(mem);
 		ret = -EFAULT;
-		goto unlock_exit;
+		goto free_exit;
 	}
 
 	pageshift = PAGE_SHIFT;
@@ -183,21 +164,43 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	}
 
 good_exit:
-	ret = 0;
 	atomic64_set(&mem->mapped, 1);
 	mem->used = 1;
 	mem->ua = ua;
 	mem->entries = entries;
-	*pmem = mem;
+
+	mutex_lock(&mem_list_mutex);
+
+	list_for_each_entry_rcu(mem2, &mm->context.iommu_group_mem_list, next) {
+		/* Overlap? */
+		if ((mem2->ua < (ua + (entries << PAGE_SHIFT))) &&
+				(ua < (mem2->ua +
+				       (mem2->entries << PAGE_SHIFT)))) {
+			ret = -EINVAL;
+			mutex_unlock(&mem_list_mutex);
+			goto free_exit;
+		}
+	}
 
 	list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
 
-unlock_exit:
-	if (locked_entries && ret)
-		mm_iommu_adjust_locked_vm(mm, locked_entries, false);
-
 	mutex_unlock(&mem_list_mutex);
 
+	*pmem = mem;
+
+	return 0;
+
+free_exit:
+	/* free the reference taken */
+	for (i = 0; i < pinned; i++)
+		put_page(mem->hpages[i]);
+
+	vfree(mem->hpas);
+	kfree(mem);
+
+unlock_exit:
+	mm_iommu_adjust_locked_vm(mm, locked_entries, false);
+
 	return ret;
 }
 
@@ -266,7 +269,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
 long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 {
 	long ret = 0;
-	unsigned long entries, dev_hpa;
+	unsigned long unlock_entries = 0;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -287,17 +290,17 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
 		goto unlock_exit;
 	}
 
+	if (mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA)
+		unlock_entries = mem->entries;
+
 	/* @mapped became 0 so now mappings are disabled, release the region */
-	entries = mem->entries;
-	dev_hpa = mem->dev_hpa;
 	mm_iommu_release(mem);
 
-	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA)
-		mm_iommu_adjust_locked_vm(mm, entries, false);
-
 unlock_exit:
 	mutex_unlock(&mem_list_mutex);
 
+	mm_iommu_adjust_locked_vm(mm, unlock_entries, false);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mm_iommu_put);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH kernel v2 2/2] powerpc/mm_iommu: Allow pinning large regions
  2019-04-03  4:12 ` Alexey Kardashevskiy
@ 2019-04-03  4:12   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

When called with vmas_arg==NULL, get_user_pages_longterm() allocates
an array of nr_pages*8 which can easily get greater that the max order,
for example, registering memory for a 256GB guest does this and fails
in __alloc_pages_nodemask().

This adds a loop over chunks of entries to fit the max order limit.

Fixes: 678e174c4c16 ("powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc", 2019-03-05)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 9d9be850f8c2..8330f135294f 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -98,6 +98,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	struct mm_iommu_table_group_mem_t *mem, *mem2;
 	long i, ret, locked_entries = 0, pinned = 0;
 	unsigned int pageshift;
+	unsigned long entry, chunk;
 
 	if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
 		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
@@ -134,11 +135,26 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	}
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
+	chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) /
+			sizeof(struct vm_area_struct *);
+	chunk = min(chunk, entries);
+	for (entry = 0; entry < entries; entry += chunk) {
+		unsigned long n = min(entries - entry, chunk);
+
+		ret = get_user_pages_longterm(ua + (entry << PAGE_SHIFT), n,
+				FOLL_WRITE, mem->hpages + entry, NULL);
+		if (ret == n) {
+			pinned += n;
+			continue;
+		}
+		if (ret > 0)
+			pinned += ret;
+		break;
+	}
 	up_read(&mm->mmap_sem);
-	pinned = ret > 0 ? ret : 0;
-	if (ret != entries) {
-		ret = -EFAULT;
+	if (pinned != entries) {
+		if (!ret)
+			ret = -EFAULT;
 		goto free_exit;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH kernel v2 2/2] powerpc/mm_iommu: Allow pinning large regions
@ 2019-04-03  4:12   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2019-04-03  4:12 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

When called with vmas_arg=NULL, get_user_pages_longterm() allocates
an array of nr_pages*8 which can easily get greater that the max order,
for example, registering memory for a 256GB guest does this and fails
in __alloc_pages_nodemask().

This adds a loop over chunks of entries to fit the max order limit.

Fixes: 678e174c4c16 ("powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc", 2019-03-05)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index 9d9be850f8c2..8330f135294f 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -98,6 +98,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	struct mm_iommu_table_group_mem_t *mem, *mem2;
 	long i, ret, locked_entries = 0, pinned = 0;
 	unsigned int pageshift;
+	unsigned long entry, chunk;
 
 	if (dev_hpa = MM_IOMMU_TABLE_INVALID_HPA) {
 		ret = mm_iommu_adjust_locked_vm(mm, entries, true);
@@ -134,11 +135,26 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	}
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
+	chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) /
+			sizeof(struct vm_area_struct *);
+	chunk = min(chunk, entries);
+	for (entry = 0; entry < entries; entry += chunk) {
+		unsigned long n = min(entries - entry, chunk);
+
+		ret = get_user_pages_longterm(ua + (entry << PAGE_SHIFT), n,
+				FOLL_WRITE, mem->hpages + entry, NULL);
+		if (ret = n) {
+			pinned += n;
+			continue;
+		}
+		if (ret > 0)
+			pinned += ret;
+		break;
+	}
 	up_read(&mm->mmap_sem);
-	pinned = ret > 0 ? ret : 0;
-	if (ret != entries) {
-		ret = -EFAULT;
+	if (pinned != entries) {
+		if (!ret)
+			ret = -EFAULT;
 		goto free_exit;
 	}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [kernel,v2,2/2] powerpc/mm_iommu: Allow pinning large regions
  2019-04-03  4:12   ` Alexey Kardashevskiy
@ 2019-04-21 14:07     ` Michael Ellerman
  -1 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-04-21 14:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

On Wed, 2019-04-03 at 04:12:33 UTC, Alexey Kardashevskiy wrote:
> When called with vmas_arg==NULL, get_user_pages_longterm() allocates
> an array of nr_pages*8 which can easily get greater that the max order,
> for example, registering memory for a 256GB guest does this and fails
> in __alloc_pages_nodemask().
> 
> This adds a loop over chunks of entries to fit the max order limit.
> 
> Fixes: 678e174c4c16 ("powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc", 2019-03-05)
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7a3a4d763837d3aa654cd10590309504

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [kernel,v2,1/2] powerpc/mm_iommu: Fix potential deadlock
  2019-04-03  4:12   ` Alexey Kardashevskiy
@ 2019-04-21 14:07     ` Michael Ellerman
  -1 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-04-21 14:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

On Wed, 2019-04-03 at 04:12:32 UTC, Alexey Kardashevskiy wrote:
> Currently mm_iommu_do_alloc() is called in 2 cases:
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
> 	this locks &mem_list_mutex and then locks mm::mmap_sem
> 	several times when adjusting locked_vm or pinning pages;
> - vfio_pci_nvgpu_regops::mmap() for GPU memory:
> 	this is called with mm::mmap_sem held already and it locks
> 	&mem_list_mutex.
> 
> So one can craft a userspace program to do special ioctl and mmap in
> 2 threads concurrently and cause a deadlock which lockdep warns about
> (below).
> 
> We did not hit this yet because QEMU constructs the machine in a single
> thread.
> 
> This moves the overlap check next to where the new entry is added and
> reduces the amount of time spent with &mem_list_mutex held.
> 
> This moves locked_vm adjustment from under &mem_list_mutex.
> 
> This relies on mm_iommu_adjust_locked_vm() doing nothing when entries==0.
> 
> This is one of the lockdep warnings:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
> ------------------------------------------------------
> qemu-system-ppc/8038 is trying to acquire lock:
> 000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490
> 
> but task is already holding lock:
> 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&mm->mmap_sem){++++}:
>        lock_acquire+0xf8/0x260
>        down_write+0x44/0xa0
>        mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
>        mm_iommu_do_alloc+0x310/0x490
>        tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
>        vfio_fops_unl_ioctl+0x94/0x430 [vfio]
>        do_vfs_ioctl+0xe4/0x930
>        ksys_ioctl+0xc4/0x110
>        sys_ioctl+0x28/0x80
>        system_call+0x5c/0x70
> 
> -> #0 (mem_list_mutex){+.+.}:
>        __lock_acquire+0x1484/0x1900
>        lock_acquire+0xf8/0x260
>        __mutex_lock+0x88/0xa70
>        mm_iommu_do_alloc+0x70/0x490
>        vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
>        vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
>        vfio_device_fops_mmap+0x44/0x70 [vfio]
>        mmap_region+0x5d4/0x770
>        do_mmap+0x42c/0x650
>        vm_mmap_pgoff+0x124/0x160
>        ksys_mmap_pgoff+0xdc/0x2f0
>        sys_mmap+0x40/0x80
>        system_call+0x5c/0x70
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&mm->mmap_sem);
>                                lock(mem_list_mutex);
>                                lock(&mm->mmap_sem);
>   lock(mem_list_mutex);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by qemu-system-ppc/8038:
>  #0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160
> 
> Fixes: c10c21efa4bc ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/eb9d7a62c38628ab0ba6e59d22d7cb79

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [kernel,v2,2/2] powerpc/mm_iommu: Allow pinning large regions
@ 2019-04-21 14:07     ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-04-21 14:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

On Wed, 2019-04-03 at 04:12:33 UTC, Alexey Kardashevskiy wrote:
> When called with vmas_arg=NULL, get_user_pages_longterm() allocates
> an array of nr_pages*8 which can easily get greater that the max order,
> for example, registering memory for a 256GB guest does this and fails
> in __alloc_pages_nodemask().
> 
> This adds a loop over chunks of entries to fit the max order limit.
> 
> Fixes: 678e174c4c16 ("powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc", 2019-03-05)
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7a3a4d763837d3aa654cd10590309504

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [kernel,v2,1/2] powerpc/mm_iommu: Fix potential deadlock
@ 2019-04-21 14:07     ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2019-04-21 14:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Alexey Kardashevskiy, Aneesh Kumar K.V, kvm-ppc, David Gibson

On Wed, 2019-04-03 at 04:12:32 UTC, Alexey Kardashevskiy wrote:
> Currently mm_iommu_do_alloc() is called in 2 cases:
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY ioctl() for normal memory:
> 	this locks &mem_list_mutex and then locks mm::mmap_sem
> 	several times when adjusting locked_vm or pinning pages;
> - vfio_pci_nvgpu_regops::mmap() for GPU memory:
> 	this is called with mm::mmap_sem held already and it locks
> 	&mem_list_mutex.
> 
> So one can craft a userspace program to do special ioctl and mmap in
> 2 threads concurrently and cause a deadlock which lockdep warns about
> (below).
> 
> We did not hit this yet because QEMU constructs the machine in a single
> thread.
> 
> This moves the overlap check next to where the new entry is added and
> reduces the amount of time spent with &mem_list_mutex held.
> 
> This moves locked_vm adjustment from under &mem_list_mutex.
> 
> This relies on mm_iommu_adjust_locked_vm() doing nothing when entries=0.
> 
> This is one of the lockdep warnings:
> 
> ===========================
> WARNING: possible circular locking dependency detected
> 5.1.0-rc2-le_nv2_aikATfstn1-p1 #363 Not tainted
> ------------------------------------------------------
> qemu-system-ppc/8038 is trying to acquire lock:
> 000000002ec6c453 (mem_list_mutex){+.+.}, at: mm_iommu_do_alloc+0x70/0x490
> 
> but task is already holding lock:
> 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&mm->mmap_sem){++++}:
>        lock_acquire+0xf8/0x260
>        down_write+0x44/0xa0
>        mm_iommu_adjust_locked_vm.part.1+0x4c/0x190
>        mm_iommu_do_alloc+0x310/0x490
>        tce_iommu_ioctl.part.9+0xb84/0x1150 [vfio_iommu_spapr_tce]
>        vfio_fops_unl_ioctl+0x94/0x430 [vfio]
>        do_vfs_ioctl+0xe4/0x930
>        ksys_ioctl+0xc4/0x110
>        sys_ioctl+0x28/0x80
>        system_call+0x5c/0x70
> 
> -> #0 (mem_list_mutex){+.+.}:
>        __lock_acquire+0x1484/0x1900
>        lock_acquire+0xf8/0x260
>        __mutex_lock+0x88/0xa70
>        mm_iommu_do_alloc+0x70/0x490
>        vfio_pci_nvgpu_mmap+0xc0/0x130 [vfio_pci]
>        vfio_pci_mmap+0x198/0x2a0 [vfio_pci]
>        vfio_device_fops_mmap+0x44/0x70 [vfio]
>        mmap_region+0x5d4/0x770
>        do_mmap+0x42c/0x650
>        vm_mmap_pgoff+0x124/0x160
>        ksys_mmap_pgoff+0xdc/0x2f0
>        sys_mmap+0x40/0x80
>        system_call+0x5c/0x70
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&mm->mmap_sem);
>                                lock(mem_list_mutex);
>                                lock(&mm->mmap_sem);
>   lock(mem_list_mutex);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by qemu-system-ppc/8038:
>  #0: 00000000fd7da97f (&mm->mmap_sem){++++}, at: vm_mmap_pgoff+0xf0/0x160
> 
> Fixes: c10c21efa4bc ("powerpc/vfio/iommu/kvm: Do not pin device memory", 2018-12-19)
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/eb9d7a62c38628ab0ba6e59d22d7cb79

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-21 14:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03  4:12 [PATCH kernel v2 0/2] powerpc/mm_iommu: Fixes Alexey Kardashevskiy
2019-04-03  4:12 ` Alexey Kardashevskiy
2019-04-03  4:12 ` [PATCH kernel v2 1/2] powerpc/mm_iommu: Fix potential deadlock Alexey Kardashevskiy
2019-04-03  4:12   ` Alexey Kardashevskiy
2019-04-21 14:07   ` [kernel,v2,1/2] " Michael Ellerman
2019-04-21 14:07     ` Michael Ellerman
2019-04-03  4:12 ` [PATCH kernel v2 2/2] powerpc/mm_iommu: Allow pinning large regions Alexey Kardashevskiy
2019-04-03  4:12   ` Alexey Kardashevskiy
2019-04-21 14:07   ` [kernel,v2,2/2] " Michael Ellerman
2019-04-21 14:07     ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.