linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] VM_PINNED
@ 2014-05-26 14:56 Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Roland Dreier, Sean Hefty, Hal Rosenstock,
	Mike Marciniszyn, Peter Zijlstra

Hi all,

I mentioned at LSF/MM that I wanted to revive this, and at the time there were
no disagreements.

I finally got around to refreshing the patch(es) so here goes.

These patches introduce VM_PINNED infrastructure, vma tracking of persistent
'pinned' page ranges. Pinned is anything that has a fixed phys address (as
required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
popular way to pin pages is through get_user_pages() but that not nessecarily
the only way.

Roland, as said, I need some IB assistance, see patches 4 and 5, where I got
lost in the qib and ipath code.

Patches 1-3 compile tested.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
@ 2014-05-26 14:56 ` Peter Zijlstra
  2014-05-29  1:48   ` Rik van Riel
  2014-05-26 14:56 ` [RFC][PATCH 2/5] mm,perf: Make use of VM_PINNED Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Peter Zijlstra, Roland Dreier, Sean Hefty,
	Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: peterz-mm-pinned-1.patch --]
[-- Type: text/plain, Size: 10686 bytes --]

Introduce VM_PINNED and related machinery to solve a number of issues;

Firstly, various subsystems (perf, IB amongst others) 'pin'
significant chunks of memory (through holding page refs or custom
maps), because this memory is unevictable we must test this against
RLIMIT_MEMLOCK.

However, you can also mlock() these ranges, resulting in double
accounting. Patch bc3e53f682 ("mm: distinguish between mlocked and
pinned pages") split the counter into mm_struct::locked_vm and
mm_struct::pinned_vm, but did not add pinned_vm against the
RLIMIT_MEMLOCK test.

This resulted in that RLIMIT_MEMLOCK would under-account, and
effectively it would allow double the amount of memory to be
unevictable.

By introducing VM_PINNED and keeping track of these ranges as VMAs we
have sufficient information to account all pages without over or
under accounting any.

Secondly, due to the long-term pinning of pages things like CMA and
compaction get into trouble, because these pages (esp. for IB) start
their life as normal movable pages, but after the 'pinning' they're
not. This results in CMA and compaction fails.

By having a single common function: mm_mpin(), before the
get_user_pages() call, we can rectify this by migrating the pages to a
more suitable location -- this patch does not do this, but provides
the infrastructure to do so.

Thirdly, because VM_LOCKED does allow unmapping (and therefore page
migration) the -rt people are not pleased and would very much like
something stronger. This provides the required infrastructure (but not
the user interfaces).

Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/mm.h       |    3 +
 include/linux/mm_types.h |    5 +
 kernel/fork.c            |    2 
 mm/mlock.c               |  133 ++++++++++++++++++++++++++++++++++++++++++-----
 mm/mmap.c                |   18 ++++--
 5 files changed, 141 insertions(+), 20 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -113,6 +113,7 @@ extern unsigned int kobjsize(const void
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
 #define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 
+#define VM_PINNED	0x00001000
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
@@ -1808,6 +1809,8 @@ static inline void mm_populate(unsigned
 	/* Ignore errors */
 	(void) __mm_populate(addr, len, 1);
 }
+extern int mm_mpin(unsigned long start, unsigned long end);
+extern int mm_munpin(unsigned long start, unsigned long end);
 #else
 static inline void mm_populate(unsigned long addr, unsigned long len) {}
 #endif
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -469,6 +469,11 @@ static inline cpumask_t *mm_cpumask(stru
 	return mm->cpu_vm_mask_var;
 }
 
+static inline unsigned long mm_locked_pages(struct mm_struct *mm)
+{
+	return mm->pinned_vm + mm->locked_vm;
+}
+
 #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
 /*
  * Memory barriers to keep this state in sync are graciously provided by
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -410,6 +410,8 @@ static int dup_mmap(struct mm_struct *mm
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~VM_LOCKED;
+		if (tmp->vm_flags & VM_PINNED)
+			mm->pinned_vm += vma_pages(tmp);
 		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
 		if (file) {
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -549,9 +549,8 @@ static int mlock_fixup(struct vm_area_st
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgoff_t pgoff;
-	int nr_pages;
+	int nr_pages, nr_locked, nr_pinned;
 	int ret = 0;
-	int lock = !!(newflags & VM_LOCKED);
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
@@ -582,9 +581,49 @@ static int mlock_fixup(struct vm_area_st
 	 * Keep track of amount of locked VM.
 	 */
 	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
-		nr_pages = -nr_pages;
-	mm->locked_vm += nr_pages;
+
+	/*
+	 * We should only account pages once, if VM_PINNED is set pages are
+	 * accounted in mm_struct::pinned_vm, otherwise if VM_LOCKED is set,
+	 * we'll account them in mm_struct::locked_vm.
+	 *
+	 * PL  := vma->vm_flags
+	 * PL' := newflags
+	 * PLd := {pinned,locked}_vm delta
+	 *
+	 * PL->PL' PLd
+	 * -----------
+	 * 00  01  0+
+	 * 00  10  +0
+	 * 01  11  +-
+	 * 01  00  0-
+	 * 10  00  -0
+	 * 10  11  00
+	 * 11  01  -+
+	 * 11  10  00
+	 */
+
+	nr_pinned = nr_locked = 0;
+
+	if ((vma->vm_flags ^ newflags) & VM_PINNED) {
+		if (vma->vm_flags & VM_PINNED)
+			nr_pinned = -nr_pages;
+		else
+			nr_pinned = nr_pages;
+	}
+
+	if (vma->vm_flags & VM_PINNED) {
+		if ((newflags & (VM_PINNED|VM_LOCKED)) == VM_LOCKED)
+			nr_locked = nr_pages;
+	} else {
+		if (vma->vm_flags & VM_LOCKED)
+			nr_locked = -nr_pages;
+		else if (newflags & VM_LOCKED)
+			nr_locked = nr_pages;
+	}
+
+	mm->pinned_vm += nr_pinned;
+	mm->locked_vm += nr_locked;
 
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
@@ -592,7 +631,7 @@ static int mlock_fixup(struct vm_area_st
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 
-	if (lock)
+	if (((vma->vm_flags ^ newflags) & VM_PINNED) || (newflags & VM_LOCKED))
 		vma->vm_flags = newflags;
 	else
 		munlock_vma_pages_range(vma, start, end);
@@ -602,12 +641,17 @@ static int mlock_fixup(struct vm_area_st
 	return ret;
 }
 
-static int do_mlock(unsigned long start, size_t len, int on)
+#define MLOCK_F_ON	0x01
+#define MLOCK_F_PIN	0x02
+
+static int do_mlock(unsigned long start, size_t len, unsigned int flags)
 {
 	unsigned long nstart, end, tmp;
 	struct vm_area_struct * vma, * prev;
 	int error;
 
+	lockdep_assert_held(&current->mm->mmap_sem);
+
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(len != PAGE_ALIGN(len));
 	end = start + len;
@@ -624,13 +668,18 @@ static int do_mlock(unsigned long start,
 		prev = vma;
 
 	for (nstart = start ; ; ) {
-		vm_flags_t newflags;
+		vm_flags_t newflags = vma->vm_flags;
+		vm_flags_t flag = VM_LOCKED;
 
-		/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
+		if (flags & MLOCK_F_PIN)
+			flag = VM_PINNED;
 
-		newflags = vma->vm_flags & ~VM_LOCKED;
-		if (on)
-			newflags |= VM_LOCKED;
+		if (flags & MLOCK_F_ON)
+			newflags |= flag;
+		else
+			newflags &= ~flag;
+
+		/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
 
 		tmp = vma->vm_end;
 		if (tmp > end)
@@ -653,6 +702,62 @@ static int do_mlock(unsigned long start,
 	return error;
 }
 
+/**
+ * mm_mpin - create a pinned vma
+ * @start - vaddr to start the vma
+ * @len - size of the vma
+ *
+ * Creates a pinned vma, where pinning is similar in locked in that the pages
+ * will be unevictable, but stronger in that the pages will be unmappable as
+ * well. Typically this is called before a driver does get_user_pages() on a
+ * chunk of memory on behalf of a user.
+ *
+ * Returns 0 for success, otherwise:
+ * -EPERM - the caller is not privilidged
+ * -ENOMEM - the called exceeded RLIMIT_MEMLOCK
+ * -ENOMEM - failed to allocate sufficient memory
+ */
+int mm_mpin(unsigned long start, size_t len)
+{
+	unsigned long locked, lock_limit;
+
+	if (!can_do_mlock())
+		return -EPERM;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK);
+	lock_limit >>= PAGE_SHIFT;
+	locked = len >> PAGE_SHIFT;
+	locked += mm_locked_pages(current->mm);
+
+	if (!((locked <= lock_limit) || capable(CAP_IPC_LOCK)))
+		return -ENOMEM;
+
+	/*
+	 * Because we're typically called before a long-term get_user_pages()
+	 * call, this is a good spot to avoid eviction related problems:
+	 *
+	 * TODO; migrate all these pages out from CMA regions.
+	 * TODO; migrate the pages to UNMOVABLE page blocks.
+	 * TODO; linearize these pages to avoid compaction issues.
+	 */
+	return do_mlock(start, len, MLOCK_F_ON | MLOCK_F_PIN);
+
+}
+EXPORT_SYMBOL_GPL(mm_mpin);
+
+/**
+ * mm_munpin - destroys a pinned vma
+ * @start - vaddr of the vma start
+ * @len - size of the vma
+ *
+ * Undoes mm_mpin().
+ */
+int mm_munpin(unsigned long start, size_t len)
+{
+	return do_mlock(start, start + len, MLOCK_F_PIN);
+}
+EXPORT_SYMBOL_GPL(mm_munpin);
+
 /*
  * __mm_populate - populate and/or mlock pages within a range of address space.
  *
@@ -736,11 +841,11 @@ SYSCALL_DEFINE2(mlock, unsigned long, st
 
 	down_write(&current->mm->mmap_sem);
 
-	locked += current->mm->locked_vm;
+	locked += mm_locked_pages(current->mm);
 
 	/* check against resource limits */
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
-		error = do_mlock(start, len, 1);
+		error = do_mlock(start, len, MLOCK_F_ON);
 
 	up_write(&current->mm->mmap_sem);
 	if (!error)
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1210,7 +1210,7 @@ static inline int mlock_future_check(str
 	/*  mlock MCL_FUTURE? */
 	if (flags & VM_LOCKED) {
 		locked = len >> PAGE_SHIFT;
-		locked += mm->locked_vm;
+		locked += mm_locked_pages(mm);
 		lock_limit = rlimit(RLIMIT_MEMLOCK);
 		lock_limit >>= PAGE_SHIFT;
 		if (locked > lock_limit && !capable(CAP_IPC_LOCK))
@@ -1616,7 +1616,9 @@ unsigned long mmap_region(struct file *f
 	perf_event_mmap(vma);
 
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
-	if (vm_flags & VM_LOCKED) {
+	if (vm_flags & VM_PINNED) {
+		mm->pinned_vm += (len >> PAGE_SHIFT);
+	} else if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm)))
 			mm->locked_vm += (len >> PAGE_SHIFT);
@@ -2069,7 +2071,7 @@ static int acct_stack_growth(struct vm_a
 	if (vma->vm_flags & VM_LOCKED) {
 		unsigned long locked;
 		unsigned long limit;
-		locked = mm->locked_vm + grow;
+		locked = mm_locked_pages(mm) + grow;
 		limit = ACCESS_ONCE(rlim[RLIMIT_MEMLOCK].rlim_cur);
 		limit >>= PAGE_SHIFT;
 		if (locked > limit && !capable(CAP_IPC_LOCK))
@@ -2538,13 +2540,17 @@ int do_munmap(struct mm_struct *mm, unsi
 	/*
 	 * unlock any mlock()ed ranges before detaching vmas
 	 */
-	if (mm->locked_vm) {
+	if (mm->locked_vm || mm->pinned_vm) {
 		struct vm_area_struct *tmp = vma;
 		while (tmp && tmp->vm_start < end) {
-			if (tmp->vm_flags & VM_LOCKED) {
+			if (tmp->vm_flags & VM_PINNED)
+				mm->pinned_vm -= vma_pages(tmp);
+			else if (tmp->vm_flags & VM_LOCKED)
 				mm->locked_vm -= vma_pages(tmp);
+
+			if (tmp->vm_flags & VM_LOCKED)
 				munlock_vma_pages_all(tmp);
-			}
+
 			tmp = tmp->vm_next;
 		}
 	}



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 2/5] mm,perf: Make use of VM_PINNED
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces Peter Zijlstra
@ 2014-05-26 14:56 ` Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 3/5] mm,ib,umem: Use VM_PINNED Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Thomas Gleixner, Peter Zijlstra, Sean Hefty,
	Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: peterz-mm-pinned-2.patch --]
[-- Type: text/plain, Size: 3628 bytes --]

Change the perf RLIMIT_MEMLOCK accounting to use VM_PINNED. Because
the way VM_PINNED works (it hard assumes the entire vma length is
accounted) we have to slightly change semantics.

We used to only add to the RLIMIT_MEMLOCK accounting once we were over
the per-user limit, now we'll directly account to both.

XXX: anon_inode_inode->i_mapping doesn't have AS_UNEVICTABLE set,
should it?

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Roland Dreier <roland@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/events/core.c |   36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4059,13 +4059,12 @@ static const struct vm_operations_struct
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
+	unsigned long locked, lock_limit, lock_extra;
 	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
-	unsigned long locked, lock_limit;
-	struct ring_buffer *rb;
 	unsigned long vma_size;
 	unsigned long nr_pages;
-	long user_extra, extra;
+	struct ring_buffer *rb;
 	int ret = 0, flags = 0;
 
 	/*
@@ -4117,26 +4116,22 @@ static int perf_mmap(struct file *file,
 		goto unlock;
 	}
 
-	user_extra = nr_pages + 1;
-	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
+	lock_extra = nr_pages + 1;
 
 	/*
 	 * Increase the limit linearly with more CPUs:
 	 */
+	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
 	user_lock_limit *= num_online_cpus();
 
-	user_locked = atomic_long_read(&user->locked_vm) + user_extra;
-
-	extra = 0;
-	if (user_locked > user_lock_limit)
-		extra = user_locked - user_lock_limit;
+	user_locked = atomic_long_read(&user->locked_vm) + lock_extra;
 
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	lock_limit >>= PAGE_SHIFT;
-	locked = vma->vm_mm->pinned_vm + extra;
+	locked = mm_locked_pages(vma->vm_mm) + lock_extra;
 
-	if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
-		!capable(CAP_IPC_LOCK)) {
+	if ((user_locked > user_lock_limit && locked > lock_limit) &&
+	    perf_paranoid_tracepoint_raw() && !capable(CAP_IPC_LOCK)) {
 		ret = -EPERM;
 		goto unlock;
 	}
@@ -4146,7 +4141,7 @@ static int perf_mmap(struct file *file,
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
+	rb = rb_alloc(nr_pages,
 		event->attr.watermark ? event->attr.wakeup_watermark : 0,
 		event->cpu, flags);
 
@@ -4156,11 +4151,9 @@ static int perf_mmap(struct file *file,
 	}
 
 	atomic_set(&rb->mmap_count, 1);
-	rb->mmap_locked = extra;
 	rb->mmap_user = get_current_user();
 
-	atomic_long_add(user_extra, &user->locked_vm);
-	vma->vm_mm->pinned_vm += extra;
+	atomic_long_add(lock_extra, &user->locked_vm);
 
 	ring_buffer_attach(event, rb);
 
@@ -4173,10 +4166,13 @@ static int perf_mmap(struct file *file,
 	mutex_unlock(&event->mmap_mutex);
 
 	/*
-	 * Since pinned accounting is per vm we cannot allow fork() to copy our
-	 * vma.
+	 * VM_PINNED - this memory is pinned as we need to write to it from
+	 *             pretty much any context and cannot page.
+	 * VM_DONTCOPY - don't share over fork()
+	 * VM_DONTEXPAND - its not stack
+	 * VM_DONTDUMP - ...
 	 */
-	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags |= VM_PINNED | VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
 	vma->vm_ops = &perf_mmap_vmops;
 
 	return ret;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 3/5] mm,ib,umem: Use VM_PINNED
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 2/5] mm,perf: Make use of VM_PINNED Peter Zijlstra
@ 2014-05-26 14:56 ` Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 4/5] mm,ib,ipath: " Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Peter Zijlstra, Roland Dreier, Sean Hefty,
	Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: peterz-mm-pinned-3.patch --]
[-- Type: text/plain, Size: 4598 bytes --]

Use the mm_mpin() call to prepare the vm for a 'persistent'
get_user_pages() call.

Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/infiniband/core/umem.c |   51 ++++++++++++++++-------------------------
 include/rdma/ib_umem.h         |    3 +-
 2 files changed, 23 insertions(+), 31 deletions(-)

--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -81,15 +81,12 @@ struct ib_umem *ib_umem_get(struct ib_uc
 	struct ib_umem *umem;
 	struct page **page_list;
 	struct vm_area_struct **vma_list;
-	unsigned long locked;
-	unsigned long lock_limit;
 	unsigned long cur_base;
 	unsigned long npages;
 	int ret;
 	int i;
 	DEFINE_DMA_ATTRS(attrs);
 	struct scatterlist *sg, *sg_list_start;
-	int need_release = 0;
 
 	if (dmasync)
 		dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
@@ -135,26 +132,23 @@ struct ib_umem *ib_umem_get(struct ib_uc
 
 	down_write(&current->mm->mmap_sem);
 
-	locked     = npages + current->mm->pinned_vm;
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
 	cur_base = addr & PAGE_MASK;
+	umem->start_addr = cur_base;
+	umem->nr_pages = npages;
 
 	if (npages == 0) {
 		ret = -EINVAL;
-		goto out;
+		goto err;
 	}
 
+	ret = mm_mpin(umem->start_addr, npages * PAGE_SIZE);
+	if (ret)
+		goto err;
+
 	ret = sg_alloc_table(&umem->sg_head, npages, GFP_KERNEL);
 	if (ret)
-		goto out;
+		goto err_unpin;
 
-	need_release = 1;
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
@@ -164,7 +158,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
 				     1, !umem->writable, page_list, vma_list);
 
 		if (ret < 0)
-			goto out;
+			goto err_release;
 
 		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
@@ -189,25 +183,26 @@ struct ib_umem *ib_umem_get(struct ib_uc
 
 	if (umem->nmap <= 0) {
 		ret = -ENOMEM;
-		goto out;
+		goto err_release;
 	}
 
 	ret = 0;
 
-out:
-	if (ret < 0) {
-		if (need_release)
-			__ib_umem_release(context->device, umem, 0);
-		kfree(umem);
-	} else
-		current->mm->pinned_vm = locked;
-
+unlock:
 	up_write(&current->mm->mmap_sem);
 	if (vma_list)
 		free_page((unsigned long) vma_list);
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+
+err_release:
+	__ib_umem_release(context->device, umem, 0);
+err_unpin:
+	mm_munpin(umem->start_addr, umem->nr_pages * PAGE_SIZE);
+err:
+	kfree(umem);
+	goto unlock;
 }
 EXPORT_SYMBOL(ib_umem_get);
 
@@ -216,7 +211,7 @@ static void ib_umem_account(struct work_
 	struct ib_umem *umem = container_of(work, struct ib_umem, work);
 
 	down_write(&umem->mm->mmap_sem);
-	umem->mm->pinned_vm -= umem->diff;
+	mm_munpin(umem->start_addr, umem->nr_pages * PAGE_SIZE);
 	up_write(&umem->mm->mmap_sem);
 	mmput(umem->mm);
 	kfree(umem);
@@ -230,7 +225,6 @@ void ib_umem_release(struct ib_umem *ume
 {
 	struct ib_ucontext *context = umem->context;
 	struct mm_struct *mm;
-	unsigned long diff;
 
 	__ib_umem_release(umem->context->device, umem, 1);
 
@@ -240,8 +234,6 @@ void ib_umem_release(struct ib_umem *ume
 		return;
 	}
 
-	diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
-
 	/*
 	 * We may be called with the mm's mmap_sem already held.  This
 	 * can happen when a userspace munmap() is the call that drops
@@ -254,7 +246,6 @@ void ib_umem_release(struct ib_umem *ume
 		if (!down_write_trylock(&mm->mmap_sem)) {
 			INIT_WORK(&umem->work, ib_umem_account);
 			umem->mm   = mm;
-			umem->diff = diff;
 
 			queue_work(ib_wq, &umem->work);
 			return;
@@ -262,7 +253,7 @@ void ib_umem_release(struct ib_umem *ume
 	} else
 		down_write(&mm->mmap_sem);
 
-	current->mm->pinned_vm -= diff;
+	mm_munpin(umem->start_addr, umem->nr_pages * PAGE_SIZE);
 	up_write(&mm->mmap_sem);
 	mmput(mm);
 	kfree(umem);
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -41,6 +41,8 @@ struct ib_ucontext;
 
 struct ib_umem {
 	struct ib_ucontext     *context;
+	unsigned long		start_addr;
+	unsigned long		nr_pages;
 	size_t			length;
 	int			offset;
 	int			page_size;
@@ -48,7 +50,6 @@ struct ib_umem {
 	int                     hugetlb;
 	struct work_struct	work;
 	struct mm_struct       *mm;
-	unsigned long		diff;
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 4/5] mm,ib,ipath: Use VM_PINNED
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-05-26 14:56 ` [RFC][PATCH 3/5] mm,ib,umem: Use VM_PINNED Peter Zijlstra
@ 2014-05-26 14:56 ` Peter Zijlstra
  2014-05-26 14:56 ` [RFC][PATCH 5/5] mm,ib,qib: " Peter Zijlstra
  2014-05-26 20:19 ` [RFC][PATCH 0/5] VM_PINNED Konstantin Khlebnikov
  5 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Peter Zijlstra, Roland Dreier, Sean Hefty,
	Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: peterz-mm-pinned-4.patch --]
[-- Type: text/plain, Size: 5265 bytes --]

XXX: got lost, need the start vaddr in the release paths, help?

Use the mm_mpin() call to prepare the vm for a 'persistent'
get_user_pages() call.

Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Marciniszyn <infinipath@intel.com> 
Cc: Roland Dreier <roland@kernel.org> 
Cc: Sean Hefty <sean.hefty@intel.com> 
Cc: Hal Rosenstock <hal.rosenstock@gmail.com> 
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/infiniband/hw/ipath/ipath_file_ops.c   |    6 ++---
 drivers/infiniband/hw/ipath/ipath_kernel.h     |    4 +--
 drivers/infiniband/hw/ipath/ipath_user_pages.c |   28 +++++++++++--------------
 3 files changed, 18 insertions(+), 20 deletions(-)

--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -456,7 +456,7 @@ static int ipath_tid_update(struct ipath
 				ipath_stats.sps_pageunlocks++;
 			}
 		}
-		ipath_release_user_pages(pagep, cnt);
+		ipath_release_user_pages(pagep, /* vaddr */, cnt);
 	} else {
 		/*
 		 * Copy the updated array, with ipath_tid's filled in, back
@@ -572,7 +572,7 @@ static int ipath_tid_free(struct ipath_p
 			pci_unmap_page(dd->pcidev,
 				dd->ipath_physshadow[porttid + tid],
 				PAGE_SIZE, PCI_DMA_FROMDEVICE);
-			ipath_release_user_pages(&p, 1);
+			ipath_release_user_pages(&p, /* vaddr */, 1);
 			ipath_stats.sps_pageunlocks++;
 		} else
 			ipath_dbg("Unused tid %u, ignoring\n", tid);
@@ -2025,7 +2025,7 @@ static void unlock_expected_tids(struct
 		dd->ipath_pageshadow[i] = NULL;
 		pci_unmap_page(dd->pcidev, dd->ipath_physshadow[i],
 			PAGE_SIZE, PCI_DMA_FROMDEVICE);
-		ipath_release_user_pages_on_close(&ps, 1);
+		ipath_release_user_pages_on_close(&ps, /* vaddr */, 1);
 		cnt++;
 		ipath_stats.sps_pageunlocks++;
 	}
--- a/drivers/infiniband/hw/ipath/ipath_kernel.h
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h
@@ -1082,8 +1082,8 @@ static inline void ipath_sdma_desc_unres
 #define IPATH_DFLT_RCVHDRSIZE 9
 
 int ipath_get_user_pages(unsigned long, size_t, struct page **);
-void ipath_release_user_pages(struct page **, size_t);
-void ipath_release_user_pages_on_close(struct page **, size_t);
+void ipath_release_user_pages(struct page **, unsigned long, size_t);
+void ipath_release_user_pages_on_close(struct page **, unsigned long, size_t);
 int ipath_eeprom_read(struct ipath_devdata *, u8, void *, int);
 int ipath_eeprom_write(struct ipath_devdata *, u8, const void *, int);
 int ipath_tempsense_read(struct ipath_devdata *, u8 regnum);
--- a/drivers/infiniband/hw/ipath/ipath_user_pages.c
+++ b/drivers/infiniband/hw/ipath/ipath_user_pages.c
@@ -39,7 +39,7 @@
 #include "ipath_kernel.h"
 
 static void __ipath_release_user_pages(struct page **p, size_t num_pages,
-				   int dirty)
+				       int dirty)
 {
 	size_t i;
 
@@ -56,16 +56,12 @@ static void __ipath_release_user_pages(s
 static int __ipath_get_user_pages(unsigned long start_page, size_t num_pages,
 				  struct page **p, struct vm_area_struct **vma)
 {
-	unsigned long lock_limit;
 	size_t got;
 	int ret;
 
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if (num_pages > lock_limit) {
-		ret = -ENOMEM;
+	ret = mm_mpin(start_page, num_pages * PAGE_SIZE);
+	if (ret)
 		goto bail;
-	}
 
 	ipath_cdbg(VERBOSE, "pin %lx pages from vaddr %lx\n",
 		   (unsigned long) num_pages, start_page);
@@ -78,14 +74,12 @@ static int __ipath_get_user_pages(unsign
 		if (ret < 0)
 			goto bail_release;
 	}
-
-	current->mm->pinned_vm += num_pages;
-
 	ret = 0;
 	goto bail;
 
 bail_release:
 	__ipath_release_user_pages(p, got, 0);
+	mm_munpin(start_page, num_pages * PAGE_SIZE);
 bail:
 	return ret;
 }
@@ -172,13 +166,13 @@ int ipath_get_user_pages(unsigned long s
 	return ret;
 }
 
-void ipath_release_user_pages(struct page **p, size_t num_pages)
+void ipath_release_user_pages(struct page **p, unsigned long start_page,
+			      size_t num_pages)
 {
 	down_write(&current->mm->mmap_sem);
 
 	__ipath_release_user_pages(p, num_pages, 1);
-
-	current->mm->pinned_vm -= num_pages;
+	mm_munpin(start_page, num_pages * PAGE_SIZE);
 
 	up_write(&current->mm->mmap_sem);
 }
@@ -186,6 +180,7 @@ void ipath_release_user_pages(struct pag
 struct ipath_user_pages_work {
 	struct work_struct work;
 	struct mm_struct *mm;
+	unsigned long start_page;
 	unsigned long num_pages;
 };
 
@@ -195,13 +190,15 @@ static void user_pages_account(struct wo
 		container_of(_work, struct ipath_user_pages_work, work);
 
 	down_write(&work->mm->mmap_sem);
-	work->mm->pinned_vm -= work->num_pages;
+	mm_munpin(work->start_page, work->num_pages * PAGE_SIZE);
 	up_write(&work->mm->mmap_sem);
 	mmput(work->mm);
 	kfree(work);
 }
 
-void ipath_release_user_pages_on_close(struct page **p, size_t num_pages)
+void ipath_release_user_pages_on_close(struct page **p,
+				       unsigned long start_page,
+				       size_t num_pages)
 {
 	struct ipath_user_pages_work *work;
 	struct mm_struct *mm;
@@ -218,6 +215,7 @@ void ipath_release_user_pages_on_close(s
 
 	INIT_WORK(&work->work, user_pages_account);
 	work->mm = mm;
+	work->start_page = start_page;
 	work->num_pages = num_pages;
 
 	queue_work(ib_wq, &work->work);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 5/5] mm,ib,qib: Use VM_PINNED
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-05-26 14:56 ` [RFC][PATCH 4/5] mm,ib,ipath: " Peter Zijlstra
@ 2014-05-26 14:56 ` Peter Zijlstra
  2014-05-26 20:19 ` [RFC][PATCH 0/5] VM_PINNED Konstantin Khlebnikov
  5 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 14:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Mike Marciniszyn, Roland Dreier, Sean Hefty,
	Hal Rosenstock, Peter Zijlstra

[-- Attachment #1: peterz-mm-pinned-5.patch --]
[-- Type: text/plain, Size: 3788 bytes --]

XXX: got lost, need the start vaddr in the release paths, help?

Use the mm_mpin() call to prepare the vm for a 'persistent'
get_user_pages() call.

Cc: Christoph Lameter <cl@linux.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Marciniszyn <infinipath@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/infiniband/hw/qib/qib.h            |    2 +-
 drivers/infiniband/hw/qib/qib_file_ops.c   |    6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c |   16 ++++++----------
 3 files changed, 10 insertions(+), 14 deletions(-)

--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -1376,7 +1376,7 @@ void qib_sdma_process_event(struct qib_p
 #define QIB_RCVHDR_ENTSIZE 32
 
 int qib_get_user_pages(unsigned long, size_t, struct page **);
-void qib_release_user_pages(struct page **, size_t);
+void qib_release_user_pages(struct page **, unsigned long, size_t);
 int qib_eeprom_read(struct qib_devdata *, u8, void *, int);
 int qib_eeprom_write(struct qib_devdata *, u8, const void *, int);
 u32 __iomem *qib_getsendbuf_range(struct qib_devdata *, u32 *, u32, u32);
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -423,7 +423,7 @@ static int qib_tid_update(struct qib_ctx
 				dd->pageshadow[ctxttid + tid] = NULL;
 			}
 		}
-		qib_release_user_pages(pagep, cnt);
+		qib_release_user_pages(pagep, /* vaddr */, cnt);
 	} else {
 		/*
 		 * Copy the updated array, with qib_tid's filled in, back
@@ -535,7 +535,7 @@ static int qib_tid_free(struct qib_ctxtd
 				      RCVHQ_RCV_TYPE_EXPECTED, dd->tidinvalid);
 			pci_unmap_page(dd->pcidev, phys, PAGE_SIZE,
 				       PCI_DMA_FROMDEVICE);
-			qib_release_user_pages(&p, 1);
+			qib_release_user_pages(&p, /* vaddr */, 1);
 		}
 	}
 done:
@@ -1796,7 +1796,7 @@ static void unlock_expected_tids(struct
 		dd->pageshadow[i] = NULL;
 		pci_unmap_page(dd->pcidev, phys, PAGE_SIZE,
 			       PCI_DMA_FROMDEVICE);
-		qib_release_user_pages(&p, 1);
+		qib_release_user_pages(&p, /* vaddr */, 1);
 		cnt++;
 	}
 }
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -54,16 +54,12 @@ static void __qib_release_user_pages(str
 static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
 				struct page **p, struct vm_area_struct **vma)
 {
-	unsigned long lock_limit;
 	size_t got;
 	int ret;
 
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
-		ret = -ENOMEM;
+	ret = mm_mpin(start_page, num_pages * PAGE_SIZE);
+	if (ret)
 		goto bail;
-	}
 
 	for (got = 0; got < num_pages; got += ret) {
 		ret = get_user_pages(current, current->mm,
@@ -74,13 +70,12 @@ static int __qib_get_user_pages(unsigned
 			goto bail_release;
 	}
 
-	current->mm->pinned_vm += num_pages;
-
 	ret = 0;
 	goto bail;
 
 bail_release:
 	__qib_release_user_pages(p, got, 0);
+	mm_munpin(start_page, num_pages * PAGE_SIZE);
 bail:
 	return ret;
 }
@@ -143,15 +138,16 @@ int qib_get_user_pages(unsigned long sta
 	return ret;
 }
 
-void qib_release_user_pages(struct page **p, size_t num_pages)
+void qib_release_user_pages(struct page **p, unsigned long start_page, size_t num_pages)
 {
 	if (current->mm) /* during close after signal, mm can be NULL */
 		down_write(&current->mm->mmap_sem);
 
 	__qib_release_user_pages(p, num_pages, 1);
 
+
 	if (current->mm) {
-		current->mm->pinned_vm -= num_pages;
+		mm_munpin(start_page, num_pages * PAGE_SIZE);
 		up_write(&current->mm->mmap_sem);
 	}
 }



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-05-26 14:56 ` [RFC][PATCH 5/5] mm,ib,qib: " Peter Zijlstra
@ 2014-05-26 20:19 ` Konstantin Khlebnikov
  2014-05-26 20:32   ` Peter Zijlstra
  5 siblings, 1 reply; 28+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-26 20:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Hi all,
>
> I mentioned at LSF/MM that I wanted to revive this, and at the time there were
> no disagreements.
>
> I finally got around to refreshing the patch(es) so here goes.
>
> These patches introduce VM_PINNED infrastructure, vma tracking of persistent
> 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
> required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
> popular way to pin pages is through get_user_pages() but that not nessecarily
> the only way.

Lol, this looks like resurrection of VM_RESERVED which I've removed
not so long time ago.

Maybe single-bit state isn't flexible enought?
This supposed to supports pinning only by one user and only in its own mm?

This might be done as extension of existing memory-policy engine.
It allows to keep vm_area_struct slim in normal cases and change
behaviour when needed.
memory-policy might hold reference-counter of "pinners", track
ownership and so on.

>
> Roland, as said, I need some IB assistance, see patches 4 and 5, where I got
> lost in the qib and ipath code.
>
> Patches 1-3 compile tested.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-26 20:19 ` [RFC][PATCH 0/5] VM_PINNED Konstantin Khlebnikov
@ 2014-05-26 20:32   ` Peter Zijlstra
  2014-05-26 20:49     ` Konstantin Khlebnikov
  2014-08-01 10:16     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-26 20:32 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 12:19:16AM +0400, Konstantin Khlebnikov wrote:
> On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Hi all,
> >
> > I mentioned at LSF/MM that I wanted to revive this, and at the time there were
> > no disagreements.
> >
> > I finally got around to refreshing the patch(es) so here goes.
> >
> > These patches introduce VM_PINNED infrastructure, vma tracking of persistent
> > 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
> > required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
> > popular way to pin pages is through get_user_pages() but that not nessecarily
> > the only way.
> 
> Lol, this looks like resurrection of VM_RESERVED which I've removed
> not so long time ago.

Not sure what VM_RESERVED did, but there might be a similarity.

> Maybe single-bit state isn't flexible enought?

Not sure what you mean, the one bit is perfectly fine for what I want it
to do.

> This supposed to supports pinning only by one user and only in its own mm?

Pretty much, that's adequate for all users I'm aware of and mirrors the
mlock semantics.

> This might be done as extension of existing memory-policy engine.
> It allows to keep vm_area_struct slim in normal cases and change
> behaviour when needed.
> memory-policy might hold reference-counter of "pinners", track
> ownership and so on.

That all sounds like raping the mempolicy code and massive over
engineering.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-26 20:32   ` Peter Zijlstra
@ 2014-05-26 20:49     ` Konstantin Khlebnikov
  2014-05-27 10:29       ` Peter Zijlstra
  2014-08-01 10:16     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 28+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-26 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 12:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 12:19:16AM +0400, Konstantin Khlebnikov wrote:
>> On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > Hi all,
>> >
>> > I mentioned at LSF/MM that I wanted to revive this, and at the time there were
>> > no disagreements.
>> >
>> > I finally got around to refreshing the patch(es) so here goes.
>> >
>> > These patches introduce VM_PINNED infrastructure, vma tracking of persistent
>> > 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
>> > required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
>> > popular way to pin pages is through get_user_pages() but that not nessecarily
>> > the only way.
>>
>> Lol, this looks like resurrection of VM_RESERVED which I've removed
>> not so long time ago.

That was swap-out prevention from 2.4 era, something between VM_IO and
VM_LOCKED with various side effects.

>
> Not sure what VM_RESERVED did, but there might be a similarity.
>
>> Maybe single-bit state isn't flexible enought?
>
> Not sure what you mean, the one bit is perfectly fine for what I want it
> to do.
>
>> This supposed to supports pinning only by one user and only in its own mm?
>
> Pretty much, that's adequate for all users I'm aware of and mirrors the
> mlock semantics.

Ok, fine. Because get_user_pages is used sometimes for pinning pages
from different mm.

Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
its functionality.
Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
This will make accounting easier. No?

>
>> This might be done as extension of existing memory-policy engine.
>> It allows to keep vm_area_struct slim in normal cases and change
>> behaviour when needed.
>> memory-policy might hold reference-counter of "pinners", track
>> ownership and so on.
>
> That all sounds like raping the mempolicy code and massive over
> engineering.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-26 20:49     ` Konstantin Khlebnikov
@ 2014-05-27 10:29       ` Peter Zijlstra
  2014-05-27 10:54         ` Peter Zijlstra
  2014-05-27 14:34         ` Christoph Lameter
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 10:29 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 12:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Pretty much, that's adequate for all users I'm aware of and mirrors the
> > mlock semantics.
> 
> Ok, fine. Because get_user_pages is used sometimes for pinning pages
> from different mm.

Yeah, but that's fairly uncommon, and not something we do for very long
times afaik.

In fact I could only find:

  drivers/iommu/amd_iommu_v2.c

  fs/exec.c -- temporary use
  kernel/events/uprobes.c -- temporary use
  mm/ksm.c -- temporary use
  mm/process_vm_access.c -- temporary use

With exception of the iommu one (it wasn't immediately obvious and I
didn't want to stare at the iommu muck too long), they're all temporary,
we drop the page almost immediately again after doing some short work.

The things I care about for VM_PINNED are long term pins, like the IB
stuff, which sets up its RDMA buffers at the start of a program and
basically leaves them in place for the entire duration of said program.

Such pins will disrupt CMA, compaction and pretty much anything that
relies on the page blocks stuff.

> Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> its functionality.
> Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> This will make accounting easier. No?

I prefer the PINNED name because the not being able to migrate is only
one of the desired effects of it, not the primary effect. We're really
looking to keep physical pages in place and preserve mappings.

The -rt people for example really want to avoid faults (even minor
faults), and DONTMIGRATE would still allow unmapping.

Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
hadn't considered that. The first thing that came to mind is that that
might make the fork() semantics difficult, but maybe it works out.

And while we're on the subject, my patch preserves PINNED over fork()
but maybe we don't actually need that either.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 10:29       ` Peter Zijlstra
@ 2014-05-27 10:54         ` Peter Zijlstra
  2014-05-27 11:11           ` Konstantin Khlebnikov
  2014-05-27 14:34         ` Christoph Lameter
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 10:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> > its functionality.
> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> > This will make accounting easier. No?
> 
> I prefer the PINNED name because the not being able to migrate is only
> one of the desired effects of it, not the primary effect. We're really
> looking to keep physical pages in place and preserve mappings.
> 
> The -rt people for example really want to avoid faults (even minor
> faults), and DONTMIGRATE would still allow unmapping.
> 
> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
> hadn't considered that. The first thing that came to mind is that that
> might make the fork() semantics difficult, but maybe it works out.
> 
> And while we're on the subject, my patch preserves PINNED over fork()
> but maybe we don't actually need that either.

So pinned_vm is userspace exposed, which means we have to maintain the
individual counts, and doing the fully orthogonal accounting is 'easier'
than trying to get the boundary cases right.

That is, if we have a program that does mlockall() and then does the IB
ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
after it splits the vma, and then do the pinned accounting.

Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
we don't know what to restore the vma to on mm_munpin().

So while the accounting looks tricky, it has simpler semantics.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 10:54         ` Peter Zijlstra
@ 2014-05-27 11:11           ` Konstantin Khlebnikov
  2014-05-27 11:50             ` Vlastimil Babka
  2014-05-27 13:05             ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-27 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
>> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
>> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
>> > its functionality.
>> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
>> > This will make accounting easier. No?
>>
>> I prefer the PINNED name because the not being able to migrate is only
>> one of the desired effects of it, not the primary effect. We're really
>> looking to keep physical pages in place and preserve mappings.

Ah, I just mixed it up.

>>
>> The -rt people for example really want to avoid faults (even minor
>> faults), and DONTMIGRATE would still allow unmapping.
>>
>> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
>> hadn't considered that. The first thing that came to mind is that that
>> might make the fork() semantics difficult, but maybe it works out.
>>
>> And while we're on the subject, my patch preserves PINNED over fork()
>> but maybe we don't actually need that either.
>
> So pinned_vm is userspace exposed, which means we have to maintain the
> individual counts, and doing the fully orthogonal accounting is 'easier'
> than trying to get the boundary cases right.
>
> That is, if we have a program that does mlockall() and then does the IB
> ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
> after it splits the vma, and then do the pinned accounting.
>
> Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
> we don't know what to restore the vma to on mm_munpin().
>
> So while the accounting looks tricky, it has simpler semantics.

What if VM_PINNED will require VM_LOCKED?
I.e. user must mlock it before pining and cannot munlock vma while it's pinned.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 11:11           ` Konstantin Khlebnikov
@ 2014-05-27 11:50             ` Vlastimil Babka
  2014-05-27 13:09               ` Peter Zijlstra
  2014-05-27 13:05             ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2014-05-27 11:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Peter Zijlstra
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On 05/27/2014 01:11 PM, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
>>> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
>>>> Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
>>>> its functionality.
>>>> Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
>>>> This will make accounting easier. No?
>>>
>>> I prefer the PINNED name because the not being able to migrate is only
>>> one of the desired effects of it, not the primary effect. We're really
>>> looking to keep physical pages in place and preserve mappings.
> 
> Ah, I just mixed it up.
> 
>>>
>>> The -rt people for example really want to avoid faults (even minor
>>> faults), and DONTMIGRATE would still allow unmapping.
>>>
>>> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
>>> hadn't considered that. The first thing that came to mind is that that
>>> might make the fork() semantics difficult, but maybe it works out.
>>>
>>> And while we're on the subject, my patch preserves PINNED over fork()
>>> but maybe we don't actually need that either.
>>
>> So pinned_vm is userspace exposed, which means we have to maintain the
>> individual counts, and doing the fully orthogonal accounting is 'easier'
>> than trying to get the boundary cases right.
>>
>> That is, if we have a program that does mlockall() and then does the IB
>> ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
>> after it splits the vma, and then do the pinned accounting.
>>
>> Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
>> we don't know what to restore the vma to on mm_munpin().
>>
>> So while the accounting looks tricky, it has simpler semantics.
> 
> What if VM_PINNED will require VM_LOCKED?
> I.e. user must mlock it before pining and cannot munlock vma while it's pinned.

Mlocking makes sense, as pages won't be uselessly scanned on
non-evictable LRU, no? (Or maybe I just don't see that something else
prevents then from being there already).

Anyway I like the idea of playing nicer with compaction etc.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 11:11           ` Konstantin Khlebnikov
  2014-05-27 11:50             ` Vlastimil Babka
@ 2014-05-27 13:05             ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Linux Kernel Mailing List, Christoph Lameter,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 03:11:36PM +0400, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
> >> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> >> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> >> > its functionality.
> >> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> >> > This will make accounting easier. No?
> >>
> >> I prefer the PINNED name because the not being able to migrate is only
> >> one of the desired effects of it, not the primary effect. We're really
> >> looking to keep physical pages in place and preserve mappings.
> 
> Ah, I just mixed it up.
> 
> >>
> >> The -rt people for example really want to avoid faults (even minor
> >> faults), and DONTMIGRATE would still allow unmapping.
> >>
> >> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
> >> hadn't considered that. The first thing that came to mind is that that
> >> might make the fork() semantics difficult, but maybe it works out.
> >>
> >> And while we're on the subject, my patch preserves PINNED over fork()
> >> but maybe we don't actually need that either.
> >
> > So pinned_vm is userspace exposed, which means we have to maintain the
> > individual counts, and doing the fully orthogonal accounting is 'easier'
> > than trying to get the boundary cases right.
> >
> > That is, if we have a program that does mlockall() and then does the IB
> > ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
> > after it splits the vma, and then do the pinned accounting.
> >
> > Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
> > we don't know what to restore the vma to on mm_munpin().
> >
> > So while the accounting looks tricky, it has simpler semantics.
> 
> What if VM_PINNED will require VM_LOCKED?
> I.e. user must mlock it before pining and cannot munlock vma while it's pinned.

So I don't like restrictions like that if its at all possible to avoid
-- and in this case, I already wrote the code and its not _that_
complicated.

But also; that would mean that we'd either have to make mm_mpin() do the
mlock unconditionally (which rather defeats the purpose) or break
userspace assumptions. I'm fairly sure the IB ioctl() don't require the
memory to be mlocked.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 11:50             ` Vlastimil Babka
@ 2014-05-27 13:09               ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Roland Dreier, Sean Hefty, Hal Rosenstock,
	Mike Marciniszyn

On Tue, May 27, 2014 at 01:50:47PM +0200, Vlastimil Babka wrote:
> > What if VM_PINNED will require VM_LOCKED?
> > I.e. user must mlock it before pining and cannot munlock vma while it's pinned.
> 
> Mlocking makes sense, as pages won't be uselessly scanned on
> non-evictable LRU, no? (Or maybe I just don't see that something else
> prevents then from being there already).

We can add VM_PINNED logic to page_check_reference() and
try_to_unmap_one() to avoid the scanning if that's a problem. But that's
additional bits.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 10:29       ` Peter Zijlstra
  2014-05-27 10:54         ` Peter Zijlstra
@ 2014-05-27 14:34         ` Christoph Lameter
  2014-05-27 14:46           ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2014-05-27 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, 27 May 2014, Peter Zijlstra wrote:

> The things I care about for VM_PINNED are long term pins, like the IB
> stuff, which sets up its RDMA buffers at the start of a program and
> basically leaves them in place for the entire duration of said program.

Ok that also means the pages are not to be allocated from ZONE_MOVABLE?

I expected the use of a page flag. With a vma flag we may have a situation
that mapping a page into a vma changes it to pinned and terminating a
process may unpin a page. That means the zone that the page should be
allocated from changes.

Pinned pages in ZONE_MOVABLE are not a good idea. But since "kernelcore"
is rarely used maybe that is not an issue?





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 14:34         ` Christoph Lameter
@ 2014-05-27 14:46           ` Peter Zijlstra
  2014-05-27 15:14             ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 14:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 09:34:22AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
> 
> > The things I care about for VM_PINNED are long term pins, like the IB
> > stuff, which sets up its RDMA buffers at the start of a program and
> > basically leaves them in place for the entire duration of said program.
> 
> Ok that also means the pages are not to be allocated from ZONE_MOVABLE?

Well, like with IB, they start out as normal userspace pages, and will
be from ZONE_MOVABLE.

> I expected the use of a page flag. With a vma flag we may have a situation
> that mapping a page into a vma changes it to pinned and terminating a
> process may unpin a page. That means the zone that the page should be
> allocated from changes.

So the only way to 'map' something into pinned is what perf does (have
the f_ops->mmap call set VM_PINNED). But that way already ensures we
have full control over the allocation since its a custom file.

And in fact the perf buffer is allocated with GFP_KERNEL and is thus
already not from MOVABLE.

Any other use, like (again) the IB stuff, will go through
get_user_pages() which will ensure all the pages are mapped and present.

So I don't think this is a real problem and certainly not one that
requires a page flag.

> Pinned pages in ZONE_MOVABLE are not a good idea. But since "kernelcore"
> is rarely used maybe that is not an issue?

Well, the idea was to migrate pages to a more suitable location on
mm_mpin(). We could choose to move them out again on mm_munpin() or not.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 14:46           ` Peter Zijlstra
@ 2014-05-27 15:14             ` Christoph Lameter
  2014-05-27 15:31               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2014-05-27 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, 27 May 2014, Peter Zijlstra wrote:

> Well, like with IB, they start out as normal userspace pages, and will
> be from ZONE_MOVABLE.

Well we could change that now I think. If the VMA has VM_PINNED set
pages then do not allocate from ZONE_MOVABLE.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 15:14             ` Christoph Lameter
@ 2014-05-27 15:31               ` Peter Zijlstra
  2014-05-27 16:31                 ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 15:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
> 
> > Well, like with IB, they start out as normal userspace pages, and will
> > be from ZONE_MOVABLE.
> 
> Well we could change that now I think. If the VMA has VM_PINNED set
> pages then do not allocate from ZONE_MOVABLE.

But most allocations sites don't have the vma. We allocate page-cache
pages based on its address_space/mapping, not on whatever vma they're
mapped into.

So I still think the sanest way to do this is by making mm_mpin() do a
mm_populate() and have reclaim skip VM_PINNED pages (so they stay
present), and then migrate the lot out of MOVABLE.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 15:31               ` Peter Zijlstra
@ 2014-05-27 16:31                 ` Christoph Lameter
  2014-05-27 16:43                   ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2014-05-27 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, 27 May 2014, Peter Zijlstra wrote:

> On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> > On Tue, 27 May 2014, Peter Zijlstra wrote:
> >
> > > Well, like with IB, they start out as normal userspace pages, and will
> > > be from ZONE_MOVABLE.
> >
> > Well we could change that now I think. If the VMA has VM_PINNED set
> > pages then do not allocate from ZONE_MOVABLE.
>
> But most allocations sites don't have the vma. We allocate page-cache
> pages based on its address_space/mapping, not on whatever vma they're
> mapped into.

Most allocations by the application for an address range also must
consider a memory allocation policy which is also bound to a vma and we
have code for that in mm/mempolicy.c

Code could be easily added to alloc_pages_vma() to consider the pinned
status on allocation. Remove GFP_MOVABLE if the vma is pinned.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 16:31                 ` Christoph Lameter
@ 2014-05-27 16:43                   ` Peter Zijlstra
  2014-05-27 16:56                     ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 16:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 11:31:21AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
> 
> > On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> > > On Tue, 27 May 2014, Peter Zijlstra wrote:
> > >
> > > > Well, like with IB, they start out as normal userspace pages, and will
> > > > be from ZONE_MOVABLE.
> > >
> > > Well we could change that now I think. If the VMA has VM_PINNED set
> > > pages then do not allocate from ZONE_MOVABLE.
> >
> > But most allocations sites don't have the vma. We allocate page-cache
> > pages based on its address_space/mapping, not on whatever vma they're
> > mapped into.
> 
> Most allocations by the application for an address range also must
> consider a memory allocation policy which is also bound to a vma and we
> have code for that in mm/mempolicy.c
> 
> Code could be easily added to alloc_pages_vma() to consider the pinned
> status on allocation. Remove GFP_MOVABLE if the vma is pinned.

Yes, but alloc_pages_vma() isn't used for shared pages (with exception
of shmem and hugetlbfs).

So whichever way around we have to do the mm_populate() + eviction hook
+ migration code, and since that equally covers the anon case, why
bother?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 16:43                   ` Peter Zijlstra
@ 2014-05-27 16:56                     ` Christoph Lameter
  2014-05-27 17:29                       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2014-05-27 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, 27 May 2014, Peter Zijlstra wrote:

> > Code could be easily added to alloc_pages_vma() to consider the pinned
> > status on allocation. Remove GFP_MOVABLE if the vma is pinned.
>
> Yes, but alloc_pages_vma() isn't used for shared pages (with exception
> of shmem and hugetlbfs).

alloc_pages_vma() is used for all paths where we populate address ranges
with pages. This is what we are doing when pinning. Pages are not
allocated outside of a vma context.

What do you mean by shared pages that are not shmem pages? AnonPages that
are referenced from multiple processes?

> So whichever way around we have to do the mm_populate() + eviction hook
> + migration code, and since that equally covers the anon case, why
> bother?

Migration is expensive and the memory registration overhead already
causes lots of complaints.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 16:56                     ` Christoph Lameter
@ 2014-05-27 17:29                       ` Peter Zijlstra
  2014-05-27 20:00                         ` Christoph Lameter
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-27 17:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, May 27, 2014 at 11:56:44AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
> 
> > > Code could be easily added to alloc_pages_vma() to consider the pinned
> > > status on allocation. Remove GFP_MOVABLE if the vma is pinned.
> >
> > Yes, but alloc_pages_vma() isn't used for shared pages (with exception
> > of shmem and hugetlbfs).
> 
> alloc_pages_vma() is used for all paths where we populate address ranges
> with pages. This is what we are doing when pinning. Pages are not
> allocated outside of a vma context.
> 
> What do you mean by shared pages that are not shmem pages? AnonPages that
> are referenced from multiple processes?

Regular files.. they get allocated through __page_cache_alloc(). AFAIK
there is nothing stopping people from pinning file pages for RDMA or
other purposes. Unusual maybe, but certainly not impossible, and
therefore we must be able to handle it.

> > So whichever way around we have to do the mm_populate() + eviction hook
> > + migration code, and since that equally covers the anon case, why
> > bother?
> 
> Migration is expensive and the memory registration overhead already
> causes lots of complaints.

Sure, but first to the simple thing, then if its a problem do something
else.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 17:29                       ` Peter Zijlstra
@ 2014-05-27 20:00                         ` Christoph Lameter
  2014-05-28  6:14                           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2014-05-27 20:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

On Tue, 27 May 2014, Peter Zijlstra wrote:

> > What do you mean by shared pages that are not shmem pages? AnonPages that
> > are referenced from multiple processes?
>
> Regular files.. they get allocated through __page_cache_alloc(). AFAIK
> there is nothing stopping people from pinning file pages for RDMA or
> other purposes. Unusual maybe, but certainly not impossible, and
> therefore we must be able to handle it.

Typically structures for RDMA are allocated on the heap.

The main use case is pinnning the executable pages in the page cache?

Pinning mmapped pagecache pages may not have the desired effect
if those pages are modified and need updates on disk with corresponding
faults to track the dirty state etc. This may get more complicated.

> > Migration is expensive and the memory registration overhead already
> > causes lots of complaints.
>
> Sure, but first to the simple thing, then if its a problem do something
> else.

I thought the main issue here were the pinning of IB/RDMA buffers.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-27 20:00                         ` Christoph Lameter
@ 2014-05-28  6:14                           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-28  6:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Thomas Gleixner, Andrew Morton, Hugh Dickins, Mel Gorman,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: Type: text/plain, Size: 1437 bytes --]

On Tue, May 27, 2014 at 03:00:15PM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
> 
> > > What do you mean by shared pages that are not shmem pages? AnonPages that
> > > are referenced from multiple processes?
> >
> > Regular files.. they get allocated through __page_cache_alloc(). AFAIK
> > there is nothing stopping people from pinning file pages for RDMA or
> > other purposes. Unusual maybe, but certainly not impossible, and
> > therefore we must be able to handle it.
> 
> Typically structures for RDMA are allocated on the heap.

Sure, typically. But that's not enough.

> The main use case is pinnning the executable pages in the page cache?

No.. although that's one of the things the -rt people are interested in.

> > > Migration is expensive and the memory registration overhead already
> > > causes lots of complaints.
> >
> > Sure, but first to the simple thing, then if its a problem do something
> > else.
> 
> I thought the main issue here were the pinning of IB/RDMA buffers.

It is,.. but you have to deal with the generic case before you go off
doing specific things.

You're approaching the problem from the wrong way; first make it such
that everything works, only then, optimize some specific case, if and
when it becomes important.

Don't start by only looking at the one specific case you're interested
in and forget about everything else.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces
  2014-05-26 14:56 ` [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces Peter Zijlstra
@ 2014-05-29  1:48   ` Rik van Riel
  2014-05-29  8:01     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2014-05-29  1:48 UTC (permalink / raw)
  To: Peter Zijlstra, linux-mm, linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Roland Dreier, Sean Hefty, Hal Rosenstock,
	Mike Marciniszyn

On 05/26/2014 10:56 AM, Peter Zijlstra wrote:

>  include/linux/mm.h       |    3 +
>  include/linux/mm_types.h |    5 +
>  kernel/fork.c            |    2 
>  mm/mlock.c               |  133 ++++++++++++++++++++++++++++++++++++++++++-----
>  mm/mmap.c                |   18 ++++--
>  5 files changed, 141 insertions(+), 20 deletions(-)

I'm guessing you will also want a patch that adds some code to
rmap.c, madvise.c, and a few other places to actually enforce
the VM_PINNED semantics?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces
  2014-05-29  1:48   ` Rik van Riel
@ 2014-05-29  8:01     ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2014-05-29  8:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Christoph Lameter, Thomas Gleixner,
	Andrew Morton, Hugh Dickins, Mel Gorman, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Mike Marciniszyn

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

On Wed, May 28, 2014 at 09:48:43PM -0400, Rik van Riel wrote:
> On 05/26/2014 10:56 AM, Peter Zijlstra wrote:
> 
> >  include/linux/mm.h       |    3 +
> >  include/linux/mm_types.h |    5 +
> >  kernel/fork.c            |    2 
> >  mm/mlock.c               |  133 ++++++++++++++++++++++++++++++++++++++++++-----
> >  mm/mmap.c                |   18 ++++--
> >  5 files changed, 141 insertions(+), 20 deletions(-)
> 
> I'm guessing you will also want a patch that adds some code to
> rmap.c, madvise.c, and a few other places to actually enforce
> the VM_PINNED semantics?

Eventually, yes. As it stands they're not needed, because perf pages
aren't in the pagecache and IB goes a big get_user_pages() and the
elevated refcount stops everything dead.

But yes, once we go do fancy things like migrate the pages into
UNMOVABLE blocks, then we need to have VM_PINNED itself mean something,
instead of it describing something.

But before we go down there, I'd like to get the IB bits sorted and
general agreement on the path.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 0/5] VM_PINNED
  2014-05-26 20:32   ` Peter Zijlstra
  2014-05-26 20:49     ` Konstantin Khlebnikov
@ 2014-08-01 10:16     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 28+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-01 10:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Christoph Lameter, Thomas Gleixner, Andrew Morton, Hugh Dickins,
	Mel Gorman, Roland Dreier, Sean Hefty, Hal Rosenstock,
	Mike Marciniszyn, Alex Williamson, Alexey Kardashevskiy

On Mon, 2014-05-26 at 22:32 +0200, Peter Zijlstra wrote:

> Not sure what you mean, the one bit is perfectly fine for what I want it
> to do.
> 
> > This supposed to supports pinning only by one user and only in its own mm?
> 
> Pretty much, that's adequate for all users I'm aware of and mirrors the
> mlock semantics.

Ok so I only just saw this. CC'ing Alex Williamson

There is definitely another potential user for that stuff which is KVM
with passed-through devices.

What vfio does today on x86 is "interesting":

Look at drivers/vfio/vfio_iommu_type1.c and functions vfio_pin_pages()

I especially like the racy "delayed" accounting ...

The problem is that in the generic case of VFIO, we don't know in
advance what needs to be pinned. The user might pin pages on demand and
it has to be a reasonably fast path.

Additionally, a given page can be mapped multiple times and we don't
have a good place to keep a counter....

So the one bit of state is definitely not enough.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-08-01 10:23 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-26 14:56 [RFC][PATCH 0/5] VM_PINNED Peter Zijlstra
2014-05-26 14:56 ` [RFC][PATCH 1/5] mm: Introduce VM_PINNED and interfaces Peter Zijlstra
2014-05-29  1:48   ` Rik van Riel
2014-05-29  8:01     ` Peter Zijlstra
2014-05-26 14:56 ` [RFC][PATCH 2/5] mm,perf: Make use of VM_PINNED Peter Zijlstra
2014-05-26 14:56 ` [RFC][PATCH 3/5] mm,ib,umem: Use VM_PINNED Peter Zijlstra
2014-05-26 14:56 ` [RFC][PATCH 4/5] mm,ib,ipath: " Peter Zijlstra
2014-05-26 14:56 ` [RFC][PATCH 5/5] mm,ib,qib: " Peter Zijlstra
2014-05-26 20:19 ` [RFC][PATCH 0/5] VM_PINNED Konstantin Khlebnikov
2014-05-26 20:32   ` Peter Zijlstra
2014-05-26 20:49     ` Konstantin Khlebnikov
2014-05-27 10:29       ` Peter Zijlstra
2014-05-27 10:54         ` Peter Zijlstra
2014-05-27 11:11           ` Konstantin Khlebnikov
2014-05-27 11:50             ` Vlastimil Babka
2014-05-27 13:09               ` Peter Zijlstra
2014-05-27 13:05             ` Peter Zijlstra
2014-05-27 14:34         ` Christoph Lameter
2014-05-27 14:46           ` Peter Zijlstra
2014-05-27 15:14             ` Christoph Lameter
2014-05-27 15:31               ` Peter Zijlstra
2014-05-27 16:31                 ` Christoph Lameter
2014-05-27 16:43                   ` Peter Zijlstra
2014-05-27 16:56                     ` Christoph Lameter
2014-05-27 17:29                       ` Peter Zijlstra
2014-05-27 20:00                         ` Christoph Lameter
2014-05-28  6:14                           ` Peter Zijlstra
2014-08-01 10:16     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).