linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users
@ 2019-01-21 17:42 Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm, linux-kernel, dave

Changes from v1 (https://patchwork.kernel.org/cover/10764923/):
 - Converted pinned_vm to atomic64 instead of atomic_long such that
   infiniband need not worry about overflows.

 - Rebased patch 1 and added Ira's reviews as well as Parvi's review
   for patch 5 (thanks!).
   
--------

Hi,

The following patches aim to provide cleanups to users that pin pages
(mostly infiniband) by converting the counter to atomic -- note that
Daniel Jordan also has patches[1] for the locked_vm counterpart and vfio.

Apart from removing a source of mmap_sem writer, we benefit in that
we can get rid of a lot of code that defers work when the lock cannot
be acquired, as well as drivers avoiding mmap_sem altogether by also
converting gup to gup_fast() and letting the mm handle it. Users
that do the gup_longterm() remain of course under at least reader mmap_sem.

Everything has been compile-tested _only_ so I hope I didn't do anything
too stupid. Please consider for v5.1.

On a similar topic and potential follow up, it would be nice to resurrect
Peter's VM_PINNED idea in that the broken semantics that occurred after
bc3e53f682 ("mm: distinguish between mlocked and pinned pages") are still
present. Also encapsulating internal mm logic via mm[un]pin() instead of
drivers having to know about internals and playing nice with compaction are
all wins.

Thanks!

[1] https://lkml.org/lkml/2018/11/5/854

Davidlohr Bueso (6):
  mm: make mm->pinned_vm an atomic64 counter
  mic/scif: do not use mmap_sem
  drivers/IB,qib: do not use mmap_sem
  drivers/IB,hfi1: do not se mmap_sem
  drivers/IB,usnic: reduce scope of mmap_sem
  drivers/IB,core: reduce scope of mmap_sem

 drivers/infiniband/core/umem.c              | 47 +++-----------------
 drivers/infiniband/hw/hfi1/user_pages.c     | 12 ++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 69 ++++++++++-------------------
 drivers/infiniband/hw/usnic/usnic_ib_main.c |  2 -
 drivers/infiniband/hw/usnic/usnic_uiom.c    | 56 +++--------------------
 drivers/infiniband/hw/usnic/usnic_uiom.h    |  1 -
 drivers/misc/mic/scif/scif_rma.c            | 38 +++++-----------
 fs/proc/task_mmu.c                          |  2 +-
 include/linux/mm_types.h                    |  2 +-
 kernel/events/core.c                        |  8 ++--
 kernel/fork.c                               |  2 +-
 mm/debug.c                                  |  3 +-
 12 files changed, 57 insertions(+), 185 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-21 21:51   ` Christopher Lameter
                     ` (3 more replies)
  2019-01-21 17:42 ` [PATCH 2/6] mic/scif: do not use mmap_sem Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, Davidlohr Bueso

Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem
and can simply some code around it for pinned_vm users. The counter
is 64-bit such that we need not worry about overflows such as rdma
user input controlled from userspace.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/core/umem.c             | 12 ++++++------
 drivers/infiniband/hw/hfi1/user_pages.c    |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c |  4 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  8 ++++----
 drivers/misc/mic/scif/scif_rma.c           |  6 +++---
 fs/proc/task_mmu.c                         |  2 +-
 include/linux/mm_types.h                   |  2 +-
 kernel/events/core.c                       |  8 ++++----
 kernel/fork.c                              |  2 +-
 mm/debug.c                                 |  3 ++-
 10 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 1efe0a74e06b..678abe1afcba 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -166,13 +166,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
 	down_write(&mm->mmap_sem);
-	if (check_add_overflow(mm->pinned_vm, npages, &new_pinned) ||
-	    (new_pinned > lock_limit && !capable(CAP_IPC_LOCK))) {
+	new_pinned = atomic64_read(&mm->pinned_vm) + npages;
+	if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
 		up_write(&mm->mmap_sem);
 		ret = -ENOMEM;
 		goto out;
 	}
-	mm->pinned_vm = new_pinned;
+	atomic64_set(&mm->pinned_vm, new_pinned);
 	up_write(&mm->mmap_sem);
 
 	cur_base = addr & PAGE_MASK;
@@ -234,7 +234,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	__ib_umem_release(context->device, umem, 0);
 vma:
 	down_write(&mm->mmap_sem);
-	mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 out:
 	if (vma_list)
@@ -263,7 +263,7 @@ static void ib_umem_release_defer(struct work_struct *work)
 	struct ib_umem *umem = container_of(work, struct ib_umem, work);
 
 	down_write(&umem->owning_mm->mmap_sem);
-	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
 	up_write(&umem->owning_mm->mmap_sem);
 
 	__ib_umem_release_tail(umem);
@@ -302,7 +302,7 @@ void ib_umem_release(struct ib_umem *umem)
 	} else {
 		down_write(&umem->owning_mm->mmap_sem);
 	}
-	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
 	up_write(&umem->owning_mm->mmap_sem);
 
 	__ib_umem_release_tail(umem);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..40a6e434190f 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -92,7 +92,7 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm,
 	size = DIV_ROUND_UP(size, PAGE_SIZE);
 
 	down_read(&mm->mmap_sem);
-	pinned = mm->pinned_vm;
+	pinned = atomic64_read(&mm->pinned_vm);
 	up_read(&mm->mmap_sem);
 
 	/* First, check the absolute limit against all pinned pages. */
@@ -112,7 +112,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 		return ret;
 
 	down_write(&mm->mmap_sem);
-	mm->pinned_vm += ret;
+	atomic64_add(ret, &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 
 	return ret;
@@ -131,7 +131,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
-		mm->pinned_vm -= npages;
+		atomic64_sub(npages, &mm->pinned_vm);
 		up_write(&mm->mmap_sem);
 	}
 }
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..602387bf98e7 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -75,7 +75,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
 			goto bail_release;
 	}
 
-	current->mm->pinned_vm += num_pages;
+	atomic64_add(num_pages, &current->mm->pinned_vm);
 
 	ret = 0;
 	goto bail;
@@ -156,7 +156,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages)
 	__qib_release_user_pages(p, num_pages, 1);
 
 	if (current->mm) {
-		current->mm->pinned_vm -= num_pages;
+		atomic64_sub(num_pages, &current->mm->pinned_vm);
 		up_write(&current->mm->mmap_sem);
 	}
 }
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index ce01a59fccc4..854436a2b437 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -129,7 +129,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	uiomr->owning_mm = mm = current->mm;
 	down_write(&mm->mmap_sem);
 
-	locked = npages + current->mm->pinned_vm;
+	locked = npages + atomic64_read(&current->mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
@@ -187,7 +187,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	if (ret < 0)
 		usnic_uiom_put_pages(chunk_list, 0);
 	else {
-		mm->pinned_vm = locked;
+		atomic64_set(&mm->pinned_vm, locked);
 		mmgrab(uiomr->owning_mm);
 	}
 
@@ -441,7 +441,7 @@ static void usnic_uiom_release_defer(struct work_struct *work)
 		container_of(work, struct usnic_uiom_reg, work);
 
 	down_write(&uiomr->owning_mm->mmap_sem);
-	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
+	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
 	up_write(&uiomr->owning_mm->mmap_sem);
 
 	__usnic_uiom_release_tail(uiomr);
@@ -469,7 +469,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr,
 	} else {
 		down_write(&uiomr->owning_mm->mmap_sem);
 	}
-	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
+	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
 	up_write(&uiomr->owning_mm->mmap_sem);
 
 	__usnic_uiom_release_tail(uiomr);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 749321eb91ae..2448368f181e 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -285,7 +285,7 @@ __scif_dec_pinned_vm_lock(struct mm_struct *mm,
 	} else {
 		down_write(&mm->mmap_sem);
 	}
-	mm->pinned_vm -= nr_pages;
+	atomic64_sub(nr_pages, &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 	return 0;
 }
@@ -299,7 +299,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
 		return 0;
 
 	locked = nr_pages;
-	locked += mm->pinned_vm;
+	locked += atomic64_read(&mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
 		dev_err(scif_info.mdev.this_device,
@@ -307,7 +307,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
 			locked, lock_limit);
 		return -ENOMEM;
 	}
-	mm->pinned_vm = locked;
+	atomic64_set(&mm->pinned_vm, locked);
 	return 0;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6976e17dba68..640ae8a47c73 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
-	SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm);
+	SEQ_PUT_DEC(" kB\nVmPin:\t", atomic64_read(&mm->pinned_vm));
 	SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss);
 	SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss);
 	SEQ_PUT_DEC(" kB\nRssAnon:\t", anon);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6312b02d65ed..0c8be6f9c92d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -404,7 +404,7 @@ struct mm_struct {
 
 		unsigned long total_vm;	   /* Total pages mapped */
 		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
-		unsigned long pinned_vm;   /* Refcount permanently increased */
+		atomic64_t    pinned_vm;   /* Refcount permanently increased */
 		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
 		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
 		unsigned long stack_vm;	   /* VM_STACK */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cd13a30f732..8df0b77a4687 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5459,7 +5459,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 
 		/* now it's safe to free the pages */
 		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
-		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+		atomic64_sub(rb->aux_mmap_locked, &vma->vm_mm->pinned_vm);
 
 		/* this has to be the last one */
 		rb_free_aux(rb);
@@ -5532,7 +5532,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	 */
 
 	atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
-	vma->vm_mm->pinned_vm -= mmap_locked;
+	atomic64_sub(mmap_locked, &vma->vm_mm->pinned_vm);
 	free_uid(mmap_user);
 
 out_put:
@@ -5680,7 +5680,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	lock_limit >>= PAGE_SHIFT;
-	locked = vma->vm_mm->pinned_vm + extra;
+	locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra;
 
 	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
 		!capable(CAP_IPC_LOCK)) {
@@ -5721,7 +5721,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 unlock:
 	if (!ret) {
 		atomic_long_add(user_extra, &user->locked_vm);
-		vma->vm_mm->pinned_vm += extra;
+		atomic64_add(extra, &vma->vm_mm->pinned_vm);
 
 		atomic_inc(&event->mmap_count);
 	} else if (rb) {
diff --git a/kernel/fork.c b/kernel/fork.c
index c48e9e244a89..a68de9032ced 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -981,7 +981,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-	mm->pinned_vm = 0;
+	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
diff --git a/mm/debug.c b/mm/debug.c
index 0abb987dad9b..bcf70e365a77 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -166,7 +166,8 @@ void dump_mm(const struct mm_struct *mm)
 		mm_pgtables_bytes(mm),
 		mm->map_count,
 		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
-		mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
+		atomic64_read(&mm->pinned_vm),
+		mm->data_vm, mm->exec_vm, mm->stack_vm,
 		mm->start_code, mm->end_code, mm->start_data, mm->end_data,
 		mm->start_brk, mm->brk, mm->start_stack,
 		mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] mic/scif: do not use mmap_sem
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 3/6] drivers/IB,qib: " Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, sudeep.dutt, ashutosh.dixit, Davidlohr Bueso

The driver uses mmap_sem for both pinned_vm accounting and
get_user_pages(). By using gup_fast() and letting the mm handle
the lock if needed, we can no longer rely on the semaphore and
simplify the whole thing.

Cc: sudeep.dutt@intel.com
Cc: ashutosh.dixit@intel.com
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/misc/mic/scif/scif_rma.c | 36 +++++++++++-------------------------
 1 file changed, 11 insertions(+), 25 deletions(-)

diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 2448368f181e..263b8ad507ea 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -272,21 +272,12 @@ static inline void __scif_release_mm(struct mm_struct *mm)
 
 static inline int
 __scif_dec_pinned_vm_lock(struct mm_struct *mm,
-			  int nr_pages, bool try_lock)
+			  int nr_pages)
 {
 	if (!mm || !nr_pages || !scif_ulimit_check)
 		return 0;
-	if (try_lock) {
-		if (!down_write_trylock(&mm->mmap_sem)) {
-			dev_err(scif_info.mdev.this_device,
-				"%s %d err\n", __func__, __LINE__);
-			return -1;
-		}
-	} else {
-		down_write(&mm->mmap_sem);
-	}
+
 	atomic64_sub(nr_pages, &mm->pinned_vm);
-	up_write(&mm->mmap_sem);
 	return 0;
 }
 
@@ -298,16 +289,16 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
 	if (!mm || !nr_pages || !scif_ulimit_check)
 		return 0;
 
-	locked = nr_pages;
-	locked += atomic64_read(&mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	locked = atomic64_add_return(nr_pages, &mm->pinned_vm);
+
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+		atomic64_sub(nr_pages, &mm->pinned_vm);
 		dev_err(scif_info.mdev.this_device,
 			"locked(%lu) > lock_limit(%lu)\n",
 			locked, lock_limit);
 		return -ENOMEM;
 	}
-	atomic64_set(&mm->pinned_vm, locked);
 	return 0;
 }
 
@@ -326,7 +317,7 @@ int scif_destroy_window(struct scif_endpt *ep, struct scif_window *window)
 
 	might_sleep();
 	if (!window->temp && window->mm) {
-		__scif_dec_pinned_vm_lock(window->mm, window->nr_pages, 0);
+		__scif_dec_pinned_vm_lock(window->mm, window->nr_pages);
 		__scif_release_mm(window->mm);
 		window->mm = NULL;
 	}
@@ -737,7 +728,7 @@ int scif_unregister_window(struct scif_window *window)
 					    ep->rma_info.dma_chan);
 		} else {
 			if (!__scif_dec_pinned_vm_lock(window->mm,
-						       window->nr_pages, 1)) {
+						       window->nr_pages)) {
 				__scif_release_mm(window->mm);
 				window->mm = NULL;
 			}
@@ -1385,28 +1376,23 @@ int __scif_pin_pages(void *addr, size_t len, int *out_prot,
 		prot |= SCIF_PROT_WRITE;
 retry:
 		mm = current->mm;
-		down_write(&mm->mmap_sem);
 		if (ulimit) {
 			err = __scif_check_inc_pinned_vm(mm, nr_pages);
 			if (err) {
-				up_write(&mm->mmap_sem);
 				pinned_pages->nr_pages = 0;
 				goto error_unmap;
 			}
 		}
 
-		pinned_pages->nr_pages = get_user_pages(
+		pinned_pages->nr_pages = get_user_pages_fast(
 				(u64)addr,
 				nr_pages,
 				(prot & SCIF_PROT_WRITE) ? FOLL_WRITE : 0,
-				pinned_pages->pages,
-				NULL);
-		up_write(&mm->mmap_sem);
+				pinned_pages->pages);
 		if (nr_pages != pinned_pages->nr_pages) {
 			if (try_upgrade) {
 				if (ulimit)
-					__scif_dec_pinned_vm_lock(mm,
-								  nr_pages, 0);
+					__scif_dec_pinned_vm_lock(mm, nr_pages);
 				/* Roll back any pinned pages */
 				for (i = 0; i < pinned_pages->nr_pages; i++) {
 					if (pinned_pages->pages[i])
@@ -1433,7 +1419,7 @@ int __scif_pin_pages(void *addr, size_t len, int *out_prot,
 	return err;
 dec_pinned:
 	if (ulimit)
-		__scif_dec_pinned_vm_lock(mm, nr_pages, 0);
+		__scif_dec_pinned_vm_lock(mm, nr_pages);
 	/* Something went wrong! Rollback */
 error_unmap:
 	pinned_pages->nr_pages = nr_pages;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 2/6] mic/scif: do not use mmap_sem Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-28 23:31   ` Jason Gunthorpe
  2019-01-21 17:42 ` [PATCH 4/6] drivers/IB,hfi1: do not se mmap_sem Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

The driver uses mmap_sem for both pinned_vm accounting and
get_user_pages(). By using gup_fast() and letting the mm handle
the lock if needed, we can no longer rely on the semaphore and
simplify the whole thing as the pinning is decoupled from the lock.

This also fixes a bug that __qib_get_user_pages was not taking into
account the current value of pinned_vm.

Cc: dennis.dalessandro@intel.com
Cc: mike.marciniszyn@intel.com
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/hw/qib/qib_user_pages.c | 67 ++++++++++--------------------
 1 file changed, 22 insertions(+), 45 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 602387bf98e7..e4114aad4a2f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -49,43 +49,6 @@ static void __qib_release_user_pages(struct page **p, size_t num_pages,
 	}
 }
 
-/*
- * Call with current->mm->mmap_sem held.
- */
-static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
-				struct page **p)
-{
-	unsigned long lock_limit;
-	size_t got;
-	int ret;
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
-		ret = -ENOMEM;
-		goto bail;
-	}
-
-	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(start_page + got * PAGE_SIZE,
-				     num_pages - got,
-				     FOLL_WRITE | FOLL_FORCE,
-				     p + got, NULL);
-		if (ret < 0)
-			goto bail_release;
-	}
-
-	atomic64_add(num_pages, &current->mm->pinned_vm);
-
-	ret = 0;
-	goto bail;
-
-bail_release:
-	__qib_release_user_pages(p, got, 0);
-bail:
-	return ret;
-}
-
 /**
  * qib_map_page - a safety wrapper around pci_map_page()
  *
@@ -137,26 +100,40 @@ int qib_map_page(struct pci_dev *hwdev, struct page *page, dma_addr_t *daddr)
 int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 		       struct page **p)
 {
+	unsigned long locked, lock_limit;
+	size_t got;
 	int ret;
 
-	down_write(&current->mm->mmap_sem);
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	locked = atomic64_add_return(num_pages, &current->mm->pinned_vm);
 
-	ret = __qib_get_user_pages(start_page, num_pages, p);
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		ret = -ENOMEM;
+		goto bail;
+	}
 
-	up_write(&current->mm->mmap_sem);
+	for (got = 0; got < num_pages; got += ret) {
+		ret = get_user_pages_fast(start_page + got * PAGE_SIZE,
+				     num_pages - got,
+				     FOLL_WRITE | FOLL_FORCE,
+				     p + got);
+		if (ret < 0)
+			goto bail_release;
+	}
 
+	return 0;
+bail_release:
+	__qib_release_user_pages(p, got, 0);
+bail:
+	atomic64_sub(num_pages, &current->mm->pinned_vm);
 	return ret;
 }
 
 void qib_release_user_pages(struct page **p, size_t num_pages)
 {
-	if (current->mm) /* during close after signal, mm can be NULL */
-		down_write(&current->mm->mmap_sem);
-
 	__qib_release_user_pages(p, num_pages, 1);
 
-	if (current->mm) {
+	if (current->mm) { /* during close after signal, mm can be NULL */
 		atomic64_sub(num_pages, &current->mm->pinned_vm);
-		up_write(&current->mm->mmap_sem);
 	}
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] drivers/IB,hfi1: do not se mmap_sem
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2019-01-21 17:42 ` [PATCH 3/6] drivers/IB,qib: " Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 5/6] drivers/IB,usnic: reduce scope of mmap_sem Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 6/6] drivers/IB,core: " Davidlohr Bueso
  5 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, mike.marciniszyn, dennis.dalessandro,
	Davidlohr Bueso

This driver already uses gup_fast() and thus we can just drop
the mmap_sem protection around the pinned_vm counter. Note that
the window between when hfi1_can_pin_pages() is called and the
actual counter is incremented remains the same as mmap_sem was
_only_ used for when ->pinned_vm was touched.

Cc: mike.marciniszyn@intel.com
Cc: dennis.dalessandro@intel.com
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/hw/hfi1/user_pages.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 40a6e434190f..24b592c6522e 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -91,9 +91,7 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm,
 	/* Convert to number of pages */
 	size = DIV_ROUND_UP(size, PAGE_SIZE);
 
-	down_read(&mm->mmap_sem);
 	pinned = atomic64_read(&mm->pinned_vm);
-	up_read(&mm->mmap_sem);
 
 	/* First, check the absolute limit against all pinned pages. */
 	if (pinned + npages >= ulimit && !can_lock)
@@ -111,9 +109,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 	if (ret < 0)
 		return ret;
 
-	down_write(&mm->mmap_sem);
 	atomic64_add(ret, &mm->pinned_vm);
-	up_write(&mm->mmap_sem);
 
 	return ret;
 }
@@ -130,8 +126,6 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 	}
 
 	if (mm) { /* during close after signal, mm can be NULL */
-		down_write(&mm->mmap_sem);
 		atomic64_sub(npages, &mm->pinned_vm);
-		up_write(&mm->mmap_sem);
 	}
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] drivers/IB,usnic: reduce scope of mmap_sem
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2019-01-21 17:42 ` [PATCH 4/6] drivers/IB,hfi1: do not se mmap_sem Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-21 17:42 ` [PATCH 6/6] drivers/IB,core: " Davidlohr Bueso
  5 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, benve, neescoba, pkaustub, Davidlohr Bueso

usnic_uiom_get_pages() uses gup_longterm() so we cannot really
get rid of mmap_sem altogether in the driver, but we can get
rid of some complexity that mmap_sem brings with only pinned_vm.
We can get rid of the wq altogether as we no longer need to
defer work to unpin pages as the counter is now atomic.

Cc: benve@cisco.com
Cc: neescoba@cisco.com
Cc: pkaustub@cisco.com
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/hw/usnic/usnic_ib_main.c |  2 --
 drivers/infiniband/hw/usnic/usnic_uiom.c    | 54 +++--------------------------
 drivers/infiniband/hw/usnic/usnic_uiom.h    |  1 -
 3 files changed, 4 insertions(+), 53 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c b/drivers/infiniband/hw/usnic/usnic_ib_main.c
index 3201dd1899c7..1d363b706314 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
@@ -691,7 +691,6 @@ static int __init usnic_ib_init(void)
 out_pci_unreg:
 	pci_unregister_driver(&usnic_ib_pci_driver);
 out_umem_fini:
-	usnic_uiom_fini();
 
 	return err;
 }
@@ -704,7 +703,6 @@ static void __exit usnic_ib_destroy(void)
 	unregister_inetaddr_notifier(&usnic_ib_inetaddr_notifier);
 	unregister_netdevice_notifier(&usnic_ib_netdevice_notifier);
 	pci_unregister_driver(&usnic_ib_pci_driver);
-	usnic_uiom_fini();
 }
 
 MODULE_DESCRIPTION("Cisco VIC (usNIC) Verbs Driver");
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 854436a2b437..505252298b52 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -47,8 +47,6 @@
 #include "usnic_uiom.h"
 #include "usnic_uiom_interval_tree.h"
 
-static struct workqueue_struct *usnic_uiom_wq;
-
 #define USNIC_UIOM_PAGE_CHUNK						\
 	((PAGE_SIZE - offsetof(struct usnic_uiom_chunk, page_list))	/\
 	((void *) &((struct usnic_uiom_chunk *) 0)->page_list[1] -	\
@@ -129,7 +127,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	uiomr->owning_mm = mm = current->mm;
 	down_write(&mm->mmap_sem);
 
-	locked = npages + atomic64_read(&current->mm->pinned_vm);
+	locked = atomic64_add_return(npages, &current->mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
@@ -184,12 +182,11 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	}
 
 out:
-	if (ret < 0)
+	if (ret < 0) {
 		usnic_uiom_put_pages(chunk_list, 0);
-	else {
-		atomic64_set(&mm->pinned_vm, locked);
+		atomic64_sub(npages, &current->mm->pinned_vm);
+	} else
 		mmgrab(uiomr->owning_mm);
-	}
 
 	up_write(&mm->mmap_sem);
 	free_page((unsigned long) page_list);
@@ -435,43 +432,12 @@ static inline size_t usnic_uiom_num_pages(struct usnic_uiom_reg *uiomr)
 	return PAGE_ALIGN(uiomr->length + uiomr->offset) >> PAGE_SHIFT;
 }
 
-static void usnic_uiom_release_defer(struct work_struct *work)
-{
-	struct usnic_uiom_reg *uiomr =
-		container_of(work, struct usnic_uiom_reg, work);
-
-	down_write(&uiomr->owning_mm->mmap_sem);
-	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
-	up_write(&uiomr->owning_mm->mmap_sem);
-
-	__usnic_uiom_release_tail(uiomr);
-}
-
 void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr,
 			    struct ib_ucontext *context)
 {
 	__usnic_uiom_reg_release(uiomr->pd, uiomr, 1);
 
-	/*
-	 * We may be called with the mm's mmap_sem already held.  This
-	 * can happen when a userspace munmap() is the call that drops
-	 * the last reference to our file and calls our release
-	 * method.  If there are memory regions to destroy, we'll end
-	 * up here and not be able to take the mmap_sem.  In that case
-	 * we defer the vm_locked accounting to a workqueue.
-	 */
-	if (context->closing) {
-		if (!down_write_trylock(&uiomr->owning_mm->mmap_sem)) {
-			INIT_WORK(&uiomr->work, usnic_uiom_release_defer);
-			queue_work(usnic_uiom_wq, &uiomr->work);
-			return;
-		}
-	} else {
-		down_write(&uiomr->owning_mm->mmap_sem);
-	}
 	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
-	up_write(&uiomr->owning_mm->mmap_sem);
-
 	__usnic_uiom_release_tail(uiomr);
 }
 
@@ -600,17 +566,5 @@ int usnic_uiom_init(char *drv_name)
 		return -EPERM;
 	}
 
-	usnic_uiom_wq = create_workqueue(drv_name);
-	if (!usnic_uiom_wq) {
-		usnic_err("Unable to alloc wq for drv %s\n", drv_name);
-		return -ENOMEM;
-	}
-
 	return 0;
 }
-
-void usnic_uiom_fini(void)
-{
-	flush_workqueue(usnic_uiom_wq);
-	destroy_workqueue(usnic_uiom_wq);
-}
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.h b/drivers/infiniband/hw/usnic/usnic_uiom.h
index b86a9731071b..c88cfa087e3a 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.h
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.h
@@ -93,5 +93,4 @@ struct usnic_uiom_reg *usnic_uiom_reg_get(struct usnic_uiom_pd *pd,
 void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr,
 			    struct ib_ucontext *ucontext);
 int usnic_uiom_init(char *drv_name);
-void usnic_uiom_fini(void);
 #endif /* USNIC_UIOM_H_ */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] drivers/IB,core: reduce scope of mmap_sem
  2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2019-01-21 17:42 ` [PATCH 5/6] drivers/IB,usnic: reduce scope of mmap_sem Davidlohr Bueso
@ 2019-01-21 17:42 ` Davidlohr Bueso
  2019-01-21 18:32   ` Jason Gunthorpe
  2019-01-21 21:53   ` Christopher Lameter
  5 siblings, 2 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 17:42 UTC (permalink / raw)
  To: akpm
  Cc: dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, Davidlohr Bueso

ib_umem_get() uses gup_longterm() and relies on the lock to
stabilze the vma_list, so we cannot really get rid of mmap_sem
altogether, but now that the counter is atomic, we can get of
some complexity that mmap_sem brings with only pinned_vm.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/core/umem.c | 41 ++---------------------------------------
 1 file changed, 2 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 678abe1afcba..b69d3efa8712 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -165,15 +165,12 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-	down_write(&mm->mmap_sem);
-	new_pinned = atomic64_read(&mm->pinned_vm) + npages;
+	new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
 	if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
-		up_write(&mm->mmap_sem);
+		atomic64_sub(npages, &mm->pinned_vm);
 		ret = -ENOMEM;
 		goto out;
 	}
-	atomic64_set(&mm->pinned_vm, new_pinned);
-	up_write(&mm->mmap_sem);
 
 	cur_base = addr & PAGE_MASK;
 
@@ -233,9 +230,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 umem_release:
 	__ib_umem_release(context->device, umem, 0);
 vma:
-	down_write(&mm->mmap_sem);
 	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
-	up_write(&mm->mmap_sem);
 out:
 	if (vma_list)
 		free_page((unsigned long) vma_list);
@@ -258,25 +253,12 @@ static void __ib_umem_release_tail(struct ib_umem *umem)
 		kfree(umem);
 }
 
-static void ib_umem_release_defer(struct work_struct *work)
-{
-	struct ib_umem *umem = container_of(work, struct ib_umem, work);
-
-	down_write(&umem->owning_mm->mmap_sem);
-	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
-	up_write(&umem->owning_mm->mmap_sem);
-
-	__ib_umem_release_tail(umem);
-}
-
 /**
  * ib_umem_release - release memory pinned with ib_umem_get
  * @umem: umem struct to release
  */
 void ib_umem_release(struct ib_umem *umem)
 {
-	struct ib_ucontext *context = umem->context;
-
 	if (umem->is_odp) {
 		ib_umem_odp_release(to_ib_umem_odp(umem));
 		__ib_umem_release_tail(umem);
@@ -285,26 +267,7 @@ void ib_umem_release(struct ib_umem *umem)
 
 	__ib_umem_release(umem->context->device, umem, 1);
 
-	/*
-	 * We may be called with the mm's mmap_sem already held.  This
-	 * can happen when a userspace munmap() is the call that drops
-	 * the last reference to our file and calls our release
-	 * method.  If there are memory regions to destroy, we'll end
-	 * up here and not be able to take the mmap_sem.  In that case
-	 * we defer the vm_locked accounting a workqueue.
-	 */
-	if (context->closing) {
-		if (!down_write_trylock(&umem->owning_mm->mmap_sem)) {
-			INIT_WORK(&umem->work, ib_umem_release_defer);
-			queue_work(ib_wq, &umem->work);
-			return;
-		}
-	} else {
-		down_write(&umem->owning_mm->mmap_sem);
-	}
 	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
-	up_write(&umem->owning_mm->mmap_sem);
-
 	__ib_umem_release_tail(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] drivers/IB,core: reduce scope of mmap_sem
  2019-01-21 17:42 ` [PATCH 6/6] drivers/IB,core: " Davidlohr Bueso
@ 2019-01-21 18:32   ` Jason Gunthorpe
  2019-01-21 19:12     ` Davidlohr Bueso
  2019-01-21 21:53   ` Christopher Lameter
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2019-01-21 18:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso

On Mon, Jan 21, 2019 at 09:42:20AM -0800, Davidlohr Bueso wrote:
> ib_umem_get() uses gup_longterm() and relies on the lock to
> stabilze the vma_list, so we cannot really get rid of mmap_sem
> altogether, but now that the counter is atomic, we can get of
> some complexity that mmap_sem brings with only pinned_vm.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  drivers/infiniband/core/umem.c | 41 ++---------------------------------------
>  1 file changed, 2 insertions(+), 39 deletions(-)

I think this addresses my comment..

Considering that it is almost all infiniband, I'd rather it go it go
through the RDMA tree with an ack from mm people? Please advise..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] drivers/IB,core: reduce scope of mmap_sem
  2019-01-21 18:32   ` Jason Gunthorpe
@ 2019-01-21 19:12     ` Davidlohr Bueso
  0 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-21 19:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso, willy

On Mon, 21 Jan 2019, Jason Gunthorpe wrote:

>On Mon, Jan 21, 2019 at 09:42:20AM -0800, Davidlohr Bueso wrote:
>> ib_umem_get() uses gup_longterm() and relies on the lock to
>> stabilze the vma_list, so we cannot really get rid of mmap_sem
>> altogether, but now that the counter is atomic, we can get of
>> some complexity that mmap_sem brings with only pinned_vm.
>>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>>  drivers/infiniband/core/umem.c | 41 ++---------------------------------------
>>  1 file changed, 2 insertions(+), 39 deletions(-)
>
>I think this addresses my comment..
>
>Considering that it is almost all infiniband, I'd rather it go it go
>through the RDMA tree with an ack from mm people? Please advise..

Yeah also Cc'ing Willy who I forgot to add for v2.

>
>Thanks,
>Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
@ 2019-01-21 21:51   ` Christopher Lameter
  2019-01-22  9:56   ` Jan Kara
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Christopher Lameter @ 2019-01-21 21:51 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso

On Mon, 21 Jan 2019, Davidlohr Bueso wrote
> Taking a sleeping lock to _only_ increment a variable is quite the
> overkill, and pretty much all users do this. Furthermore, some drivers
> (ie: infiniband and scif) that need pinned semantics can go to quite
> some trouble to actually delay via workqueue (un)accounting for pinned
> pages when not possible to acquire it.

Reviewed-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] drivers/IB,core: reduce scope of mmap_sem
  2019-01-21 17:42 ` [PATCH 6/6] drivers/IB,core: " Davidlohr Bueso
  2019-01-21 18:32   ` Jason Gunthorpe
@ 2019-01-21 21:53   ` Christopher Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christopher Lameter @ 2019-01-21 21:53 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso

On Mon, 21 Jan 2019, Davidlohr Bueso wrote:

> ib_umem_get() uses gup_longterm() and relies on the lock to
> stabilze the vma_list, so we cannot really get rid of mmap_sem
> altogether, but now that the counter is atomic, we can get of
> some complexity that mmap_sem brings with only pinned_vm.

Reviewd-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
  2019-01-21 21:51   ` Christopher Lameter
@ 2019-01-22  9:56   ` Jan Kara
  2019-01-22 15:45   ` Daniel Jordan
  2019-01-23 18:33   ` Jason Gunthorpe
  3 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2019-01-22  9:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso

On Mon 21-01-19 09:42:15, Davidlohr Bueso wrote:
> Taking a sleeping lock to _only_ increment a variable is quite the
> overkill, and pretty much all users do this. Furthermore, some drivers
> (ie: infiniband and scif) that need pinned semantics can go to quite
> some trouble to actually delay via workqueue (un)accounting for pinned
> pages when not possible to acquire it.
> 
> By making the counter atomic we no longer need to hold the mmap_sem
> and can simply some code around it for pinned_vm users. The counter
> is 64-bit such that we need not worry about overflows such as rdma
> user input controlled from userspace.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

and I really like the cleanups allowed by this in the drivers :)

								Honza

> ---
>  drivers/infiniband/core/umem.c             | 12 ++++++------
>  drivers/infiniband/hw/hfi1/user_pages.c    |  6 +++---
>  drivers/infiniband/hw/qib/qib_user_pages.c |  4 ++--
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  8 ++++----
>  drivers/misc/mic/scif/scif_rma.c           |  6 +++---
>  fs/proc/task_mmu.c                         |  2 +-
>  include/linux/mm_types.h                   |  2 +-
>  kernel/events/core.c                       |  8 ++++----
>  kernel/fork.c                              |  2 +-
>  mm/debug.c                                 |  3 ++-
>  10 files changed, 27 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 1efe0a74e06b..678abe1afcba 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -166,13 +166,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  
>  	down_write(&mm->mmap_sem);
> -	if (check_add_overflow(mm->pinned_vm, npages, &new_pinned) ||
> -	    (new_pinned > lock_limit && !capable(CAP_IPC_LOCK))) {
> +	new_pinned = atomic64_read(&mm->pinned_vm) + npages;
> +	if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
>  		up_write(&mm->mmap_sem);
>  		ret = -ENOMEM;
>  		goto out;
>  	}
> -	mm->pinned_vm = new_pinned;
> +	atomic64_set(&mm->pinned_vm, new_pinned);
>  	up_write(&mm->mmap_sem);
>  
>  	cur_base = addr & PAGE_MASK;
> @@ -234,7 +234,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
>  	__ib_umem_release(context->device, umem, 0);
>  vma:
>  	down_write(&mm->mmap_sem);
> -	mm->pinned_vm -= ib_umem_num_pages(umem);
> +	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
>  	up_write(&mm->mmap_sem);
>  out:
>  	if (vma_list)
> @@ -263,7 +263,7 @@ static void ib_umem_release_defer(struct work_struct *work)
>  	struct ib_umem *umem = container_of(work, struct ib_umem, work);
>  
>  	down_write(&umem->owning_mm->mmap_sem);
> -	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
> +	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
>  	up_write(&umem->owning_mm->mmap_sem);
>  
>  	__ib_umem_release_tail(umem);
> @@ -302,7 +302,7 @@ void ib_umem_release(struct ib_umem *umem)
>  	} else {
>  		down_write(&umem->owning_mm->mmap_sem);
>  	}
> -	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
> +	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
>  	up_write(&umem->owning_mm->mmap_sem);
>  
>  	__ib_umem_release_tail(umem);
> diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
> index e341e6dcc388..40a6e434190f 100644
> --- a/drivers/infiniband/hw/hfi1/user_pages.c
> +++ b/drivers/infiniband/hw/hfi1/user_pages.c
> @@ -92,7 +92,7 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm,
>  	size = DIV_ROUND_UP(size, PAGE_SIZE);
>  
>  	down_read(&mm->mmap_sem);
> -	pinned = mm->pinned_vm;
> +	pinned = atomic64_read(&mm->pinned_vm);
>  	up_read(&mm->mmap_sem);
>  
>  	/* First, check the absolute limit against all pinned pages. */
> @@ -112,7 +112,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
>  		return ret;
>  
>  	down_write(&mm->mmap_sem);
> -	mm->pinned_vm += ret;
> +	atomic64_add(ret, &mm->pinned_vm);
>  	up_write(&mm->mmap_sem);
>  
>  	return ret;
> @@ -131,7 +131,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
>  
>  	if (mm) { /* during close after signal, mm can be NULL */
>  		down_write(&mm->mmap_sem);
> -		mm->pinned_vm -= npages;
> +		atomic64_sub(npages, &mm->pinned_vm);
>  		up_write(&mm->mmap_sem);
>  	}
>  }
> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> index 16543d5e80c3..602387bf98e7 100644
> --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> @@ -75,7 +75,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  			goto bail_release;
>  	}
>  
> -	current->mm->pinned_vm += num_pages;
> +	atomic64_add(num_pages, &current->mm->pinned_vm);
>  
>  	ret = 0;
>  	goto bail;
> @@ -156,7 +156,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages)
>  	__qib_release_user_pages(p, num_pages, 1);
>  
>  	if (current->mm) {
> -		current->mm->pinned_vm -= num_pages;
> +		atomic64_sub(num_pages, &current->mm->pinned_vm);
>  		up_write(&current->mm->mmap_sem);
>  	}
>  }
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index ce01a59fccc4..854436a2b437 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -129,7 +129,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  	uiomr->owning_mm = mm = current->mm;
>  	down_write(&mm->mmap_sem);
>  
> -	locked = npages + current->mm->pinned_vm;
> +	locked = npages + atomic64_read(&current->mm->pinned_vm);
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  
>  	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> @@ -187,7 +187,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  	if (ret < 0)
>  		usnic_uiom_put_pages(chunk_list, 0);
>  	else {
> -		mm->pinned_vm = locked;
> +		atomic64_set(&mm->pinned_vm, locked);
>  		mmgrab(uiomr->owning_mm);
>  	}
>  
> @@ -441,7 +441,7 @@ static void usnic_uiom_release_defer(struct work_struct *work)
>  		container_of(work, struct usnic_uiom_reg, work);
>  
>  	down_write(&uiomr->owning_mm->mmap_sem);
> -	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
> +	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
>  	up_write(&uiomr->owning_mm->mmap_sem);
>  
>  	__usnic_uiom_release_tail(uiomr);
> @@ -469,7 +469,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr,
>  	} else {
>  		down_write(&uiomr->owning_mm->mmap_sem);
>  	}
> -	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
> +	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
>  	up_write(&uiomr->owning_mm->mmap_sem);
>  
>  	__usnic_uiom_release_tail(uiomr);
> diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
> index 749321eb91ae..2448368f181e 100644
> --- a/drivers/misc/mic/scif/scif_rma.c
> +++ b/drivers/misc/mic/scif/scif_rma.c
> @@ -285,7 +285,7 @@ __scif_dec_pinned_vm_lock(struct mm_struct *mm,
>  	} else {
>  		down_write(&mm->mmap_sem);
>  	}
> -	mm->pinned_vm -= nr_pages;
> +	atomic64_sub(nr_pages, &mm->pinned_vm);
>  	up_write(&mm->mmap_sem);
>  	return 0;
>  }
> @@ -299,7 +299,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
>  		return 0;
>  
>  	locked = nr_pages;
> -	locked += mm->pinned_vm;
> +	locked += atomic64_read(&mm->pinned_vm);
>  	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>  		dev_err(scif_info.mdev.this_device,
> @@ -307,7 +307,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
>  			locked, lock_limit);
>  		return -ENOMEM;
>  	}
> -	mm->pinned_vm = locked;
> +	atomic64_set(&mm->pinned_vm, locked);
>  	return 0;
>  }
>  
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6976e17dba68..640ae8a47c73 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>  	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>  	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> -	SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm);
> +	SEQ_PUT_DEC(" kB\nVmPin:\t", atomic64_read(&mm->pinned_vm));
>  	SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss);
>  	SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss);
>  	SEQ_PUT_DEC(" kB\nRssAnon:\t", anon);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6312b02d65ed..0c8be6f9c92d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -404,7 +404,7 @@ struct mm_struct {
>  
>  		unsigned long total_vm;	   /* Total pages mapped */
>  		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
> -		unsigned long pinned_vm;   /* Refcount permanently increased */
> +		atomic64_t    pinned_vm;   /* Refcount permanently increased */
>  		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
>  		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
>  		unsigned long stack_vm;	   /* VM_STACK */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3cd13a30f732..8df0b77a4687 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5459,7 +5459,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  
>  		/* now it's safe to free the pages */
>  		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
> -		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
> +		atomic64_sub(rb->aux_mmap_locked, &vma->vm_mm->pinned_vm);
>  
>  		/* this has to be the last one */
>  		rb_free_aux(rb);
> @@ -5532,7 +5532,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  	 */
>  
>  	atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
> -	vma->vm_mm->pinned_vm -= mmap_locked;
> +	atomic64_sub(mmap_locked, &vma->vm_mm->pinned_vm);
>  	free_uid(mmap_user);
>  
>  out_put:
> @@ -5680,7 +5680,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  
>  	lock_limit = rlimit(RLIMIT_MEMLOCK);
>  	lock_limit >>= PAGE_SHIFT;
> -	locked = vma->vm_mm->pinned_vm + extra;
> +	locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra;
>  
>  	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
>  		!capable(CAP_IPC_LOCK)) {
> @@ -5721,7 +5721,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  unlock:
>  	if (!ret) {
>  		atomic_long_add(user_extra, &user->locked_vm);
> -		vma->vm_mm->pinned_vm += extra;
> +		atomic64_add(extra, &vma->vm_mm->pinned_vm);
>  
>  		atomic_inc(&event->mmap_count);
>  	} else if (rb) {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c48e9e244a89..a68de9032ced 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -981,7 +981,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
>  	mm->locked_vm = 0;
> -	mm->pinned_vm = 0;
> +	atomic64_set(&mm->pinned_vm, 0);
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>  	spin_lock_init(&mm->page_table_lock);
>  	spin_lock_init(&mm->arg_lock);
> diff --git a/mm/debug.c b/mm/debug.c
> index 0abb987dad9b..bcf70e365a77 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -166,7 +166,8 @@ void dump_mm(const struct mm_struct *mm)
>  		mm_pgtables_bytes(mm),
>  		mm->map_count,
>  		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
> -		mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
> +		atomic64_read(&mm->pinned_vm),
> +		mm->data_vm, mm->exec_vm, mm->stack_vm,
>  		mm->start_code, mm->end_code, mm->start_data, mm->end_data,
>  		mm->start_brk, mm->brk, mm->start_stack,
>  		mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
> -- 
> 2.16.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
  2019-01-21 21:51   ` Christopher Lameter
  2019-01-22  9:56   ` Jan Kara
@ 2019-01-22 15:45   ` Daniel Jordan
  2019-01-23 18:33   ` Jason Gunthorpe
  3 siblings, 0 replies; 23+ messages in thread
From: Daniel Jordan @ 2019-01-22 15:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jgg, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso

On Mon, Jan 21, 2019 at 09:42:15AM -0800, Davidlohr Bueso wrote:
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6976e17dba68..640ae8a47c73 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>  	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>  	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> -	SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm);
> +	SEQ_PUT_DEC(" kB\nVmPin:\t", atomic64_read(&mm->pinned_vm));
>  	SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss);
>  	SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss);
>  	SEQ_PUT_DEC(" kB\nRssAnon:\t", anon);

This is signed on 64b but printed as unsigned, so if some bug made pinned_vm go
negative, it would appear as an obviously wrong, gigantic number.  Seems ok.

For this patch, you can add

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2019-01-22 15:45   ` Daniel Jordan
@ 2019-01-23 18:33   ` Jason Gunthorpe
  2019-01-28 21:10     ` Andrew Morton
  3 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2019-01-23 18:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso, Christoph Lameter, Jan Kara,
	Daniel Jordan

On Mon, Jan 21, 2019 at 09:42:15AM -0800, Davidlohr Bueso wrote:
> Taking a sleeping lock to _only_ increment a variable is quite the
> overkill, and pretty much all users do this. Furthermore, some drivers
> (ie: infiniband and scif) that need pinned semantics can go to quite
> some trouble to actually delay via workqueue (un)accounting for pinned
> pages when not possible to acquire it.
> 
> By making the counter atomic we no longer need to hold the mmap_sem
> and can simply some code around it for pinned_vm users. The counter
> is 64-bit such that we need not worry about overflows such as rdma
> user input controlled from userspace.

I see a number of MM people Reviewed-by this so are we good to take
this in the RDMA tree now?

Regards,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-01-23 18:33   ` Jason Gunthorpe
@ 2019-01-28 21:10     ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2019-01-28 21:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Davidlohr Bueso, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, Davidlohr Bueso, Christoph Lameter, Jan Kara,
	Daniel Jordan

On Wed, 23 Jan 2019 11:33:53 -0700 Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Jan 21, 2019 at 09:42:15AM -0800, Davidlohr Bueso wrote:
> > Taking a sleeping lock to _only_ increment a variable is quite the
> > overkill, and pretty much all users do this. Furthermore, some drivers
> > (ie: infiniband and scif) that need pinned semantics can go to quite
> > some trouble to actually delay via workqueue (un)accounting for pinned
> > pages when not possible to acquire it.
> > 
> > By making the counter atomic we no longer need to hold the mmap_sem
> > and can simply some code around it for pinned_vm users. The counter
> > is 64-bit such that we need not worry about overflows such as rdma
> > user input controlled from userspace.
> 
> I see a number of MM people Reviewed-by this so are we good to take
> this in the RDMA tree now?

Please do.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-21 17:42 ` [PATCH 3/6] drivers/IB,qib: " Davidlohr Bueso
@ 2019-01-28 23:31   ` Jason Gunthorpe
  2019-01-29  4:46     ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2019-01-28 23:31 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

On Mon, Jan 21, 2019 at 09:42:17AM -0800, Davidlohr Bueso wrote:
> The driver uses mmap_sem for both pinned_vm accounting and
> get_user_pages(). By using gup_fast() and letting the mm handle
> the lock if needed, we can no longer rely on the semaphore and
> simplify the whole thing as the pinning is decoupled from the lock.
> 
> This also fixes a bug that __qib_get_user_pages was not taking into
> account the current value of pinned_vm.
> 
> Cc: dennis.dalessandro@intel.com
> Cc: mike.marciniszyn@intel.com
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  drivers/infiniband/hw/qib/qib_user_pages.c | 67 ++++++++++--------------------
>  1 file changed, 22 insertions(+), 45 deletions(-)

I need you to respin this patch/series against the latest rdma tree:

git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

branch for-next

> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> -static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
> -				struct page **p)
> -{
> -	unsigned long lock_limit;
> -	size_t got;
> -	int ret;
> -
> -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> -	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
> -		ret = -ENOMEM;
> -		goto bail;
> -	}
> -
> -	for (got = 0; got < num_pages; got += ret) {
> -		ret = get_user_pages(start_page + got * PAGE_SIZE,
> -				     num_pages - got,
> -				     FOLL_WRITE | FOLL_FORCE,
> -				     p + got, NULL);

As this has been rightly changed to get_user_pages_longterm, and I
think the right answer to solve the conflict is to discard some of
this patch?

Since Andrew is OK with this I can move this ahead once this is
resolved, thanks.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-28 23:31   ` Jason Gunthorpe
@ 2019-01-29  4:46     ` Jason Gunthorpe
  2019-01-29 14:14       ` Davidlohr Bueso
  2019-01-29 18:50       ` Ira Weiny
  0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2019-01-29  4:46 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

On Mon, Jan 28, 2019 at 04:31:40PM -0700, Jason Gunthorpe wrote:
> On Mon, Jan 21, 2019 at 09:42:17AM -0800, Davidlohr Bueso wrote:
> > The driver uses mmap_sem for both pinned_vm accounting and
> > get_user_pages(). By using gup_fast() and letting the mm handle
> > the lock if needed, we can no longer rely on the semaphore and
> > simplify the whole thing as the pinning is decoupled from the lock.
> > 
> > This also fixes a bug that __qib_get_user_pages was not taking into
> > account the current value of pinned_vm.
> > 
> > Cc: dennis.dalessandro@intel.com
> > Cc: mike.marciniszyn@intel.com
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> >  drivers/infiniband/hw/qib/qib_user_pages.c | 67 ++++++++++--------------------
> >  1 file changed, 22 insertions(+), 45 deletions(-)
> 
> I need you to respin this patch/series against the latest rdma tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> 
> branch for-next
> 
> > diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> > -static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
> > -				struct page **p)
> > -{
> > -	unsigned long lock_limit;
> > -	size_t got;
> > -	int ret;
> > -
> > -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -
> > -	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
> > -		ret = -ENOMEM;
> > -		goto bail;
> > -	}
> > -
> > -	for (got = 0; got < num_pages; got += ret) {
> > -		ret = get_user_pages(start_page + got * PAGE_SIZE,
> > -				     num_pages - got,
> > -				     FOLL_WRITE | FOLL_FORCE,
> > -				     p + got, NULL);
> 
> As this has been rightly changed to get_user_pages_longterm, and I
> think the right answer to solve the conflict is to discard some of
> this patch?

.. and I'm looking at some of the other conversions here.. *most
likely* any caller that is manipulating rlimit for get_user_pages
should really be calling get_user_pages_longterm, so they should not
be converted to use _fast?

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-29  4:46     ` Jason Gunthorpe
@ 2019-01-29 14:14       ` Davidlohr Bueso
  2019-01-29 18:50       ` Ira Weiny
  1 sibling, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-01-29 14:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, dledford, jack, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

On Mon, 28 Jan 2019, Jason Gunthorpe wrote:

>.. and I'm looking at some of the other conversions here.. *most
>likely* any caller that is manipulating rlimit for get_user_pages
>should really be calling get_user_pages_longterm, so they should not
>be converted to use _fast?

Yeah this was something that crossed my mind could come up when I
first sent the series. I'll send a v3 where I leave the gup() calls
alone for rdma drivers; and we can at least convert the mmap_sem to
reader.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-29  4:46     ` Jason Gunthorpe
  2019-01-29 14:14       ` Davidlohr Bueso
@ 2019-01-29 18:50       ` Ira Weiny
  2019-01-29 23:19         ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Ira Weiny @ 2019-01-29 18:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Davidlohr Bueso, akpm, dledford, jack, linux-rdma, linux-mm,
	linux-kernel, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

On Mon, Jan 28, 2019 at 09:46:07PM -0700, Jason Gunthorpe wrote:
> On Mon, Jan 28, 2019 at 04:31:40PM -0700, Jason Gunthorpe wrote:
> > On Mon, Jan 21, 2019 at 09:42:17AM -0800, Davidlohr Bueso wrote:
> > > The driver uses mmap_sem for both pinned_vm accounting and
> > > get_user_pages(). By using gup_fast() and letting the mm handle
> > > the lock if needed, we can no longer rely on the semaphore and
> > > simplify the whole thing as the pinning is decoupled from the lock.
> > > 
> > > This also fixes a bug that __qib_get_user_pages was not taking into
> > > account the current value of pinned_vm.
> > > 
> > > Cc: dennis.dalessandro@intel.com
> > > Cc: mike.marciniszyn@intel.com
> > > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> > >  drivers/infiniband/hw/qib/qib_user_pages.c | 67 ++++++++++--------------------
> > >  1 file changed, 22 insertions(+), 45 deletions(-)
> > 
> > I need you to respin this patch/series against the latest rdma tree:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> > 
> > branch for-next
> > 
> > > diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> > > -static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
> > > -				struct page **p)
> > > -{
> > > -	unsigned long lock_limit;
> > > -	size_t got;
> > > -	int ret;
> > > -
> > > -	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > -
> > > -	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
> > > -		ret = -ENOMEM;
> > > -		goto bail;
> > > -	}
> > > -
> > > -	for (got = 0; got < num_pages; got += ret) {
> > > -		ret = get_user_pages(start_page + got * PAGE_SIZE,
> > > -				     num_pages - got,
> > > -				     FOLL_WRITE | FOLL_FORCE,
> > > -				     p + got, NULL);
> > 
> > As this has been rightly changed to get_user_pages_longterm, and I
> > think the right answer to solve the conflict is to discard some of
> > this patch?
> 
> .. and I'm looking at some of the other conversions here.. *most
> likely* any caller that is manipulating rlimit for get_user_pages
> should really be calling get_user_pages_longterm, so they should not
> be converted to use _fast?

Is this a question?  I'm not sure I understand the meaning here?

Ira

> 
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-29 18:50       ` Ira Weiny
@ 2019-01-29 23:19         ` Jason Gunthorpe
  2019-01-30 18:01           ` Weiny, Ira
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2019-01-29 23:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Davidlohr Bueso, akpm, dledford, jack, linux-rdma, linux-mm,
	linux-kernel, dennis.dalessandro, mike.marciniszyn,
	Davidlohr Bueso

On Tue, Jan 29, 2019 at 10:50:05AM -0800, Ira Weiny wrote:
> > .. and I'm looking at some of the other conversions here.. *most
> > likely* any caller that is manipulating rlimit for get_user_pages
> > should really be calling get_user_pages_longterm, so they should not
> > be converted to use _fast?
> 
> Is this a question?  I'm not sure I understand the meaning here?

More an invitation to disprove the statement

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-29 23:19         ` Jason Gunthorpe
@ 2019-01-30 18:01           ` Weiny, Ira
  2019-01-31 10:04             ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Weiny, Ira @ 2019-01-30 18:01 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: Davidlohr Bueso, akpm, dledford, jack, linux-rdma, linux-mm,
	linux-kernel, Dalessandro, Dennis, Marciniszyn, Mike,
	Davidlohr Bueso

> 
> On Tue, Jan 29, 2019 at 10:50:05AM -0800, Ira Weiny wrote:
> > > .. and I'm looking at some of the other conversions here.. *most
> > > likely* any caller that is manipulating rlimit for get_user_pages
> > > should really be calling get_user_pages_longterm, so they should not
> > > be converted to use _fast?
> >
> > Is this a question?  I'm not sure I understand the meaning here?
> 
> More an invitation to disprove the statement

Generally I agree.  But would be best if we could get fast GUP for performance.  I have not worked out if that will be possible with the final "longterm" solutions.

IRa

> 
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] drivers/IB,qib: do not use mmap_sem
  2019-01-30 18:01           ` Weiny, Ira
@ 2019-01-31 10:04             ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2019-01-31 10:04 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: 'Jason Gunthorpe',
	Davidlohr Bueso, akpm, dledford, jack, linux-rdma, linux-mm,
	linux-kernel, Dalessandro, Dennis, Marciniszyn, Mike,
	Davidlohr Bueso

On Wed 30-01-19 18:01:33, Weiny, Ira wrote:
> > 
> > On Tue, Jan 29, 2019 at 10:50:05AM -0800, Ira Weiny wrote:
> > > > .. and I'm looking at some of the other conversions here.. *most
> > > > likely* any caller that is manipulating rlimit for get_user_pages
> > > > should really be calling get_user_pages_longterm, so they should not
> > > > be converted to use _fast?
> > >
> > > Is this a question?  I'm not sure I understand the meaning here?
> > 
> > More an invitation to disprove the statement
> 
> Generally I agree.  But would be best if we could get fast GUP for
> performance.  I have not worked out if that will be possible with the
> final "longterm" solutions.

Initially probably not, longer-term it might be added if there are
performance data supporting that (i.e., showing real workload that would
benefit). In principle there's nothing that would prevent gup_fast like
functionality for long-term pins but I expect there will be always
additional overhead (compared to plain gup_fast()) of establishing
something like leases to identify long-term pins. But we haven't figured
out the details yet. For now we concentrate on fixing short-term pins and
issues with those.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter
  2019-02-06 17:59 [PATCH v3 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
@ 2019-02-06 17:59 ` Davidlohr Bueso
  0 siblings, 0 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2019-02-06 17:59 UTC (permalink / raw)
  To: jgg, akpm
  Cc: dledford, jgg, jack, willy, ira.weiny, linux-rdma, linux-mm,
	linux-kernel, dave, Davidlohr Bueso

Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem
and can simply some code around it for pinned_vm users. The counter
is 64-bit such that we need not worry about overflows such as rdma
user input controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/infiniband/core/umem.c             | 12 ++++++------
 drivers/infiniband/hw/hfi1/user_pages.c    |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c |  4 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  8 ++++----
 drivers/misc/mic/scif/scif_rma.c           |  6 +++---
 fs/proc/task_mmu.c                         |  2 +-
 include/linux/mm_types.h                   |  2 +-
 kernel/events/core.c                       |  8 ++++----
 kernel/fork.c                              |  2 +-
 mm/debug.c                                 |  5 +++--
 10 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 1efe0a74e06b..678abe1afcba 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -166,13 +166,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
 	down_write(&mm->mmap_sem);
-	if (check_add_overflow(mm->pinned_vm, npages, &new_pinned) ||
-	    (new_pinned > lock_limit && !capable(CAP_IPC_LOCK))) {
+	new_pinned = atomic64_read(&mm->pinned_vm) + npages;
+	if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
 		up_write(&mm->mmap_sem);
 		ret = -ENOMEM;
 		goto out;
 	}
-	mm->pinned_vm = new_pinned;
+	atomic64_set(&mm->pinned_vm, new_pinned);
 	up_write(&mm->mmap_sem);
 
 	cur_base = addr & PAGE_MASK;
@@ -234,7 +234,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	__ib_umem_release(context->device, umem, 0);
 vma:
 	down_write(&mm->mmap_sem);
-	mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 out:
 	if (vma_list)
@@ -263,7 +263,7 @@ static void ib_umem_release_defer(struct work_struct *work)
 	struct ib_umem *umem = container_of(work, struct ib_umem, work);
 
 	down_write(&umem->owning_mm->mmap_sem);
-	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
 	up_write(&umem->owning_mm->mmap_sem);
 
 	__ib_umem_release_tail(umem);
@@ -302,7 +302,7 @@ void ib_umem_release(struct ib_umem *umem)
 	} else {
 		down_write(&umem->owning_mm->mmap_sem);
 	}
-	umem->owning_mm->pinned_vm -= ib_umem_num_pages(umem);
+	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
 	up_write(&umem->owning_mm->mmap_sem);
 
 	__ib_umem_release_tail(umem);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..40a6e434190f 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -92,7 +92,7 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm,
 	size = DIV_ROUND_UP(size, PAGE_SIZE);
 
 	down_read(&mm->mmap_sem);
-	pinned = mm->pinned_vm;
+	pinned = atomic64_read(&mm->pinned_vm);
 	up_read(&mm->mmap_sem);
 
 	/* First, check the absolute limit against all pinned pages. */
@@ -112,7 +112,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 		return ret;
 
 	down_write(&mm->mmap_sem);
-	mm->pinned_vm += ret;
+	atomic64_add(ret, &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 
 	return ret;
@@ -131,7 +131,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		down_write(&mm->mmap_sem);
-		mm->pinned_vm -= npages;
+		atomic64_sub(npages, &mm->pinned_vm);
 		up_write(&mm->mmap_sem);
 	}
 }
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 075f09fb7ce3..c6c81022d313 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -75,7 +75,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
 			goto bail_release;
 	}
 
-	current->mm->pinned_vm += num_pages;
+	atomic64_add(num_pages, &current->mm->pinned_vm);
 
 	ret = 0;
 	goto bail;
@@ -156,7 +156,7 @@ void qib_release_user_pages(struct page **p, size_t num_pages)
 	__qib_release_user_pages(p, num_pages, 1);
 
 	if (current->mm) {
-		current->mm->pinned_vm -= num_pages;
+		atomic64_sub(num_pages, &current->mm->pinned_vm);
 		up_write(&current->mm->mmap_sem);
 	}
 }
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index ce01a59fccc4..854436a2b437 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -129,7 +129,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	uiomr->owning_mm = mm = current->mm;
 	down_write(&mm->mmap_sem);
 
-	locked = npages + current->mm->pinned_vm;
+	locked = npages + atomic64_read(&current->mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
@@ -187,7 +187,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	if (ret < 0)
 		usnic_uiom_put_pages(chunk_list, 0);
 	else {
-		mm->pinned_vm = locked;
+		atomic64_set(&mm->pinned_vm, locked);
 		mmgrab(uiomr->owning_mm);
 	}
 
@@ -441,7 +441,7 @@ static void usnic_uiom_release_defer(struct work_struct *work)
 		container_of(work, struct usnic_uiom_reg, work);
 
 	down_write(&uiomr->owning_mm->mmap_sem);
-	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
+	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
 	up_write(&uiomr->owning_mm->mmap_sem);
 
 	__usnic_uiom_release_tail(uiomr);
@@ -469,7 +469,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr,
 	} else {
 		down_write(&uiomr->owning_mm->mmap_sem);
 	}
-	uiomr->owning_mm->pinned_vm -= usnic_uiom_num_pages(uiomr);
+	atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
 	up_write(&uiomr->owning_mm->mmap_sem);
 
 	__usnic_uiom_release_tail(uiomr);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 749321eb91ae..2448368f181e 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -285,7 +285,7 @@ __scif_dec_pinned_vm_lock(struct mm_struct *mm,
 	} else {
 		down_write(&mm->mmap_sem);
 	}
-	mm->pinned_vm -= nr_pages;
+	atomic64_sub(nr_pages, &mm->pinned_vm);
 	up_write(&mm->mmap_sem);
 	return 0;
 }
@@ -299,7 +299,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
 		return 0;
 
 	locked = nr_pages;
-	locked += mm->pinned_vm;
+	locked += atomic64_read(&mm->pinned_vm);
 	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
 		dev_err(scif_info.mdev.this_device,
@@ -307,7 +307,7 @@ static inline int __scif_check_inc_pinned_vm(struct mm_struct *mm,
 			locked, lock_limit);
 		return -ENOMEM;
 	}
-	mm->pinned_vm = locked;
+	atomic64_set(&mm->pinned_vm, locked);
 	return 0;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0ec9edab2f3..d2902962244d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
-	SEQ_PUT_DEC(" kB\nVmPin:\t", mm->pinned_vm);
+	SEQ_PUT_DEC(" kB\nVmPin:\t", atomic64_read(&mm->pinned_vm));
 	SEQ_PUT_DEC(" kB\nVmHWM:\t", hiwater_rss);
 	SEQ_PUT_DEC(" kB\nVmRSS:\t", total_rss);
 	SEQ_PUT_DEC(" kB\nRssAnon:\t", anon);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..acea2ea2d6c4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,7 +405,7 @@ struct mm_struct {
 
 		unsigned long total_vm;	   /* Total pages mapped */
 		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
-		unsigned long pinned_vm;   /* Refcount permanently increased */
+		atomic64_t    pinned_vm;   /* Refcount permanently increased */
 		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
 		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
 		unsigned long stack_vm;	   /* VM_STACK */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cd13a30f732..8df0b77a4687 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5459,7 +5459,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 
 		/* now it's safe to free the pages */
 		atomic_long_sub(rb->aux_nr_pages, &mmap_user->locked_vm);
-		vma->vm_mm->pinned_vm -= rb->aux_mmap_locked;
+		atomic64_sub(rb->aux_mmap_locked, &vma->vm_mm->pinned_vm);
 
 		/* this has to be the last one */
 		rb_free_aux(rb);
@@ -5532,7 +5532,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	 */
 
 	atomic_long_sub((size >> PAGE_SHIFT) + 1, &mmap_user->locked_vm);
-	vma->vm_mm->pinned_vm -= mmap_locked;
+	atomic64_sub(mmap_locked, &vma->vm_mm->pinned_vm);
 	free_uid(mmap_user);
 
 out_put:
@@ -5680,7 +5680,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	lock_limit >>= PAGE_SHIFT;
-	locked = vma->vm_mm->pinned_vm + extra;
+	locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra;
 
 	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
 		!capable(CAP_IPC_LOCK)) {
@@ -5721,7 +5721,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 unlock:
 	if (!ret) {
 		atomic_long_add(user_extra, &user->locked_vm);
-		vma->vm_mm->pinned_vm += extra;
+		atomic64_add(extra, &vma->vm_mm->pinned_vm);
 
 		atomic_inc(&event->mmap_count);
 	} else if (rb) {
diff --git a/kernel/fork.c b/kernel/fork.c
index a60459947f18..fe4a051a8d15 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,7 +980,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-	mm->pinned_vm = 0;
+	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
 	spin_lock_init(&mm->arg_lock);
diff --git a/mm/debug.c b/mm/debug.c
index 0abb987dad9b..7d13941a72f9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -135,7 +135,7 @@ void dump_mm(const struct mm_struct *mm)
 		"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
 		"pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count %d\n"
 		"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
-		"pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n"
+		"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
 		"start_code %lx end_code %lx start_data %lx end_data %lx\n"
 		"start_brk %lx brk %lx start_stack %lx\n"
 		"arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
@@ -166,7 +166,8 @@ void dump_mm(const struct mm_struct *mm)
 		mm_pgtables_bytes(mm),
 		mm->map_count,
 		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
-		mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm,
+		atomic64_read(&mm->pinned_vm),
+		mm->data_vm, mm->exec_vm, mm->stack_vm,
 		mm->start_code, mm->end_code, mm->start_data, mm->end_data,
 		mm->start_brk, mm->brk, mm->start_stack,
 		mm->arg_start, mm->arg_end, mm->env_start, mm->env_end,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-02-06 18:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-21 17:42 [PATCH v2 -next 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
2019-01-21 17:42 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso
2019-01-21 21:51   ` Christopher Lameter
2019-01-22  9:56   ` Jan Kara
2019-01-22 15:45   ` Daniel Jordan
2019-01-23 18:33   ` Jason Gunthorpe
2019-01-28 21:10     ` Andrew Morton
2019-01-21 17:42 ` [PATCH 2/6] mic/scif: do not use mmap_sem Davidlohr Bueso
2019-01-21 17:42 ` [PATCH 3/6] drivers/IB,qib: " Davidlohr Bueso
2019-01-28 23:31   ` Jason Gunthorpe
2019-01-29  4:46     ` Jason Gunthorpe
2019-01-29 14:14       ` Davidlohr Bueso
2019-01-29 18:50       ` Ira Weiny
2019-01-29 23:19         ` Jason Gunthorpe
2019-01-30 18:01           ` Weiny, Ira
2019-01-31 10:04             ` Jan Kara
2019-01-21 17:42 ` [PATCH 4/6] drivers/IB,hfi1: do not se mmap_sem Davidlohr Bueso
2019-01-21 17:42 ` [PATCH 5/6] drivers/IB,usnic: reduce scope of mmap_sem Davidlohr Bueso
2019-01-21 17:42 ` [PATCH 6/6] drivers/IB,core: " Davidlohr Bueso
2019-01-21 18:32   ` Jason Gunthorpe
2019-01-21 19:12     ` Davidlohr Bueso
2019-01-21 21:53   ` Christopher Lameter
2019-02-06 17:59 [PATCH v3 0/6] mm: make pinned_vm atomic and simplify users Davidlohr Bueso
2019-02-06 17:59 ` [PATCH 1/6] mm: make mm->pinned_vm an atomic64 counter Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).