linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/3] Anonymous VMA naming patches
@ 2020-09-01 16:14 Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Sumit Semwal
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Sumit Semwal @ 2020-09-01 16:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Colin Cross,
	Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Sumit Semwal

Version v4 of these patches was sent by Colin Cross a long time ago [1]
and [2]. At the time, these patches were not merged, and it looks like they
just fell off the radar since.

In our efforts to run Android on mainline kernels, we realised that since past
some time, this patchset is needed for Android to boot, hence I am re-posting
it to try and get these discussed and hopefully merged.

I have rebased these for v5.9-rc3 and fixed minor updates as required.

---
v6: Rebased to v5.9-rc3 and addressed review comments:
    - added missing callers in fs/userfaultd.c
    - simplified the union
    - use the new access_remote_vm_locked() in show_map_vma() since that
       already holds mmap_lock
v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't defined

Colin Cross (2):
  mm: rearrange madvise code to allow for reuse
  mm: add a field to store names for private anonymous memory

Sumit Semwal (1):
  mm: memory: Add access_remote_vm_locked variant

 Documentation/filesystems/proc.rst |   2 +
 fs/proc/task_mmu.c                 |  24 +-
 fs/userfaultfd.c                   |   7 +-
 include/linux/mm.h                 |  14 +-
 include/linux/mm_types.h           |  25 +-
 include/uapi/linux/prctl.h         |   3 +
 kernel/sys.c                       |  32 +++
 mm/interval_tree.c                 |   2 +-
 mm/madvise.c                       | 356 +++++++++++++++++------------
 mm/memory.c                        |  49 +++-
 mm/mempolicy.c                     |   3 +-
 mm/mlock.c                         |   2 +-
 mm/mmap.c                          |  38 +--
 mm/mprotect.c                      |   2 +-
 14 files changed, 381 insertions(+), 178 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse
  2020-09-01 16:14 [PATCH v7 0/3] Anonymous VMA naming patches Sumit Semwal
@ 2020-09-01 16:14 ` Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
  2 siblings, 0 replies; 15+ messages in thread
From: Sumit Semwal @ 2020-09-01 16:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Colin Cross,
	Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim,
	Sumit Semwal

From: Colin Cross <ccross@google.com>

Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.

Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter.  The only caller for now is
sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
the next patch will add an additional caller.

Move handling all vma behaviors inside madvise_behavior, and rename it to
madvise_vma_behavior.

Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma.  The next patch will add support for updating a new
anon_name field as well.

Signed-off-by: Colin Cross <ccross@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Rob Landley <rob@landley.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  [sumits: rebased over v5.9-rc3]
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
---
 mm/madvise.c | 312 +++++++++++++++++++++++++++------------------------
 1 file changed, 168 insertions(+), 144 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index dd1d43cf026d..84482c21b029 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -60,76 +60,20 @@ static int madvise_need_mmap_write(int behavior)
 }
 
 /*
- * We can potentially split a vm area into separate
- * areas, each area with its own behavior.
+ * Update the vm_flags on regiion of a vma, splitting it or merging it as
+ * necessary.  Must be called with mmap_sem held for writing;
  */
-static long madvise_behavior(struct vm_area_struct *vma,
-		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end, int behavior)
+static int madvise_update_vma(struct vm_area_struct *vma,
+			      struct vm_area_struct **prev, unsigned long start,
+			      unsigned long end, unsigned long new_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int error = 0;
+	int error;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
-
-	switch (behavior) {
-	case MADV_NORMAL:
-		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
-		break;
-	case MADV_SEQUENTIAL:
-		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
-		break;
-	case MADV_RANDOM:
-		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
-		break;
-	case MADV_DONTFORK:
-		new_flags |= VM_DONTCOPY;
-		break;
-	case MADV_DOFORK:
-		if (vma->vm_flags & VM_IO) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags &= ~VM_DONTCOPY;
-		break;
-	case MADV_WIPEONFORK:
-		/* MADV_WIPEONFORK is only supported on anonymous memory. */
-		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags |= VM_WIPEONFORK;
-		break;
-	case MADV_KEEPONFORK:
-		new_flags &= ~VM_WIPEONFORK;
-		break;
-	case MADV_DONTDUMP:
-		new_flags |= VM_DONTDUMP;
-		break;
-	case MADV_DODUMP:
-		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags &= ~VM_DONTDUMP;
-		break;
-	case MADV_MERGEABLE:
-	case MADV_UNMERGEABLE:
-		error = ksm_madvise(vma, start, end, behavior, &new_flags);
-		if (error)
-			goto out_convert_errno;
-		break;
-	case MADV_HUGEPAGE:
-	case MADV_NOHUGEPAGE:
-		error = hugepage_madvise(vma, &new_flags, behavior);
-		if (error)
-			goto out_convert_errno;
-		break;
-	}
 
 	if (new_flags == vma->vm_flags) {
 		*prev = vma;
-		goto out;
+		return 0;
 	}
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
@@ -146,21 +90,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	if (start != vma->vm_start) {
 		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
 			error = -ENOMEM;
-			goto out;
+			return error;
 		}
 		error = __split_vma(mm, vma, start, 1);
 		if (error)
-			goto out_convert_errno;
+			return error;
 	}
 
 	if (end != vma->vm_end) {
 		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
 			error = -ENOMEM;
-			goto out;
+			return error;
 		}
 		error = __split_vma(mm, vma, end, 0);
 		if (error)
-			goto out_convert_errno;
+			return error;
 	}
 
 success:
@@ -169,15 +113,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	 */
 	vma->vm_flags = new_flags;
 
-out_convert_errno:
-	/*
-	 * madvise() returns EAGAIN if kernel resources, such as
-	 * slab, are temporarily unavailable.
-	 */
-	if (error == -ENOMEM)
-		error = -EAGAIN;
-out:
-	return error;
+	return 0;
 }
 
 #ifdef CONFIG_SWAP
@@ -862,6 +798,93 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+/*
+ * Apply an madvise behavior to a region of a vma.  madvise_update_vma
+ * will handle splitting a vm area into separate areas, each area with its own
+ * behavior.
+ */
+static int madvise_vma_behavior(struct vm_area_struct *vma,
+				struct vm_area_struct **prev,
+				unsigned long start, unsigned long end,
+				unsigned long behavior)
+{
+	int error = 0;
+	unsigned long new_flags = vma->vm_flags;
+
+	switch (behavior) {
+	case MADV_REMOVE:
+		return madvise_remove(vma, prev, start, end);
+	case MADV_WILLNEED:
+		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
+	case MADV_PAGEOUT:
+		return madvise_pageout(vma, prev, start, end);
+	case MADV_FREE:
+	case MADV_DONTNEED:
+		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_NORMAL:
+		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
+		break;
+	case MADV_SEQUENTIAL:
+		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
+		break;
+	case MADV_RANDOM:
+		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
+		break;
+	case MADV_DONTFORK:
+		new_flags |= VM_DONTCOPY;
+		break;
+	case MADV_DOFORK:
+		if (vma->vm_flags & VM_IO) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_DONTCOPY;
+		break;
+	case MADV_WIPEONFORK:
+		/* MADV_WIPEONFORK is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
+	case MADV_DONTDUMP:
+		new_flags |= VM_DONTDUMP;
+		break;
+	case MADV_DODUMP:
+		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_DONTDUMP;
+		break;
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+		error = ksm_madvise(vma, start, end, behavior, &new_flags);
+		if (error)
+			goto out;
+		break;
+	case MADV_HUGEPAGE:
+	case MADV_NOHUGEPAGE:
+		error = hugepage_madvise(vma, &new_flags, behavior);
+		if (error)
+			goto out;
+		break;
+	}
+
+	error = madvise_update_vma(vma, prev, start, end, new_flags);
+
+out:
+	if (error == -ENOMEM)
+		error = -EAGAIN;
+	return error;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -931,27 +954,6 @@ static int madvise_inject_error(int behavior,
 }
 #endif
 
-static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
-{
-	switch (behavior) {
-	case MADV_REMOVE:
-		return madvise_remove(vma, prev, start, end);
-	case MADV_WILLNEED:
-		return madvise_willneed(vma, prev, start, end);
-	case MADV_COLD:
-		return madvise_cold(vma, prev, start, end);
-	case MADV_PAGEOUT:
-		return madvise_pageout(vma, prev, start, end);
-	case MADV_FREE:
-	case MADV_DONTNEED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
-	default:
-		return madvise_behavior(vma, prev, start, end, behavior);
-	}
-}
-
 static bool
 madvise_behavior_valid(int behavior)
 {
@@ -990,6 +992,73 @@ madvise_behavior_valid(int behavior)
 	}
 }
 
+/*
+ * Walk the vmas in range [start,end), and call the visit function on each one.
+ * The visit function will get start and end parameters that cover the overlap
+ * between the current vma and the original range.  Any unmapped regions in the
+ * original range will result in this function returning -ENOMEM while still
+ * calling the visit function on all of the existing vmas in the range.
+ * Must be called with the mmap_lock held for reading or writing.
+ */
+static
+int madvise_walk_vmas(unsigned long start, unsigned long end,
+		      unsigned long arg,
+		      int (*visit)(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev, unsigned long start,
+				   unsigned long end, unsigned long arg))
+{
+	struct vm_area_struct *vma;
+	struct vm_area_struct *prev;
+	unsigned long tmp;
+	int unmapped_error = 0;
+
+	/*
+	 * If the interval [start,end) covers some unmapped address
+	 * ranges, just ignore them, but return -ENOMEM at the end.
+	 * - different from the way of handling in mlock etc.
+	 */
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		int error;
+
+		/* Still start < end. */
+		if (!vma)
+			return -ENOMEM;
+
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			unmapped_error = -ENOMEM;
+			start = vma->vm_start;
+			if (start >= end)
+				break;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = visit(vma, &prev, start, tmp, arg);
+		if (error)
+			return error;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_lock */
+			vma = find_vma(current->mm, start);
+	}
+
+	return unmapped_error;
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -1053,9 +1122,7 @@ madvise_behavior_valid(int behavior)
  */
 int do_madvise(unsigned long start, size_t len_in, int behavior)
 {
-	unsigned long end, tmp;
-	struct vm_area_struct *vma, *prev;
-	int unmapped_error = 0;
+	unsigned long end;
 	int error = -EINVAL;
 	int write;
 	size_t len;
@@ -1112,51 +1179,8 @@ int do_madvise(unsigned long start, size_t len_in, int behavior)
 		mmap_read_lock(current->mm);
 	}
 
-	/*
-	 * If the interval [start,end) covers some unmapped address
-	 * ranges, just ignore them, but return -ENOMEM at the end.
-	 * - different from the way of handling in mlock etc.
-	 */
-	vma = find_vma_prev(current->mm, start, &prev);
-	if (vma && start > vma->vm_start)
-		prev = vma;
-
 	blk_start_plug(&plug);
-	for (;;) {
-		/* Still start < end. */
-		error = -ENOMEM;
-		if (!vma)
-			goto out;
-
-		/* Here start < (end|vma->vm_end). */
-		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
-			start = vma->vm_start;
-			if (start >= end)
-				goto out;
-		}
-
-		/* Here vma->vm_start <= start < (end|vma->vm_end) */
-		tmp = vma->vm_end;
-		if (end < tmp)
-			tmp = end;
-
-		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
-			goto out;
-		start = tmp;
-		if (prev && start < prev->vm_end)
-			start = prev->vm_end;
-		error = unmapped_error;
-		if (start >= end)
-			goto out;
-		if (prev)
-			vma = prev->vm_next;
-		else	/* madvise_remove dropped mmap_lock */
-			vma = find_vma(current->mm, start);
-	}
-out:
+	error = madvise_walk_vmas(start, end, behavior, madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	if (write)
 		mmap_write_unlock(current->mm);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant
  2020-09-01 16:14 [PATCH v7 0/3] Anonymous VMA naming patches Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Sumit Semwal
@ 2020-09-01 16:14 ` Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
  2 siblings, 0 replies; 15+ messages in thread
From: Sumit Semwal @ 2020-09-01 16:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Colin Cross,
	Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Sumit Semwal

This allows accessing a remote vm while the mmap_lock is already
held by the caller.

While adding support for anonymous vma naming, show_map_vma()
needs to access the remote vm to get the name of the anonymous vma.
Since show_map_vma() already holds the mmap_lock, so this _locked
variant was required.

Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 49 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..e9212c0bb5ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1708,6 +1708,8 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags);
 extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, unsigned int gup_flags);
+extern int access_remote_vm_locked(struct mm_struct *mm, unsigned long addr,
+				   void *buf, int len, unsigned int gup_flags);
 
 long get_user_pages_remote(struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..207be99390e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4726,17 +4726,17 @@ EXPORT_SYMBOL_GPL(generic_access_phys);
 /*
  * Access another process' address space as given in mm.  If non-NULL, use the
  * given task for page fault accounting.
+ * This variant assumes that the mmap_lock is already held by the caller, so
+ * doesn't take the mmap_lock.
  */
-int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long addr, void *buf, int len, unsigned int gup_flags)
+int __access_remote_vm_locked(struct task_struct *tsk, struct mm_struct *mm,
+			      unsigned long addr, void *buf, int len,
+			      unsigned int gup_flags)
 {
 	struct vm_area_struct *vma;
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
-	if (mmap_read_lock_killable(mm))
-		return 0;
-
 	/* ignore errors, just check how much was successfully transferred */
 	while (len) {
 		int bytes, ret, offset;
@@ -4785,9 +4785,46 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		buf += bytes;
 		addr += bytes;
 	}
+	return buf - old_buf;
+}
+
+/*
+ * Access another process' address space as given in mm.  If non-NULL, use the
+ * given task for page fault accounting.
+ */
+int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+		       unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+	int ret;
+
+	if (mmap_read_lock_killable(mm))
+		return 0;
+
+	ret = __access_remote_vm_locked(tsk, mm, addr, buf, len, gup_flags);
 	mmap_read_unlock(mm);
 
-	return buf - old_buf;
+	return ret;
+}
+
+/**
+ * access_remote_vm_locked - access another process' address space, without
+ * taking the mmap_lock. This allows nested calls from callers that already have
+ * taken the lock.
+ *
+ * @mm:		the mm_struct of the target address space
+ * @addr:	start address to access
+ * @buf:	source or destination buffer
+ * @len:	number of bytes to transfer
+ * @gup_flags:	flags modifying lookup behaviour
+ *
+ * The caller must hold a reference on @mm, as well as hold the mmap_lock
+ *
+ * Return: number of bytes copied from source to destination.
+ */
+int access_remote_vm_locked(struct mm_struct *mm, unsigned long addr, void *buf,
+			    int len, unsigned int gup_flags)
+{
+	return __access_remote_vm_locked(NULL, mm, addr, buf, len, gup_flags);
 }
 
 /**
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-01 16:14 [PATCH v7 0/3] Anonymous VMA naming patches Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Sumit Semwal
  2020-09-01 16:14 ` [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant Sumit Semwal
@ 2020-09-01 16:14 ` Sumit Semwal
  2020-09-03 13:25   ` Kirill A. Shutemov
  2020-09-03 18:00   ` Kees Cook
  2 siblings, 2 replies; 15+ messages in thread
From: Sumit Semwal @ 2020-09-01 16:14 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Colin Cross,
	Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim,
	Sumit Semwal

From: Colin Cross <ccross@google.com>

In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in use.
 At a minimum there is libc malloc and the stack, and in many cases there
are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
multiple VM heaps (one for small objects, one for big objects, etc.).
Each of these layers usually has its own tools to inspect its usage;
malloc by compiling a debug version, the VM through heap inspection tools,
and for direct syscalls there is usually no way to track them.

On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
in userspace and slice their usage by process, shared (COW) vs.  unique
mappings, backing, etc.  This can account for real physical memory usage
even in cases like fork without exec (which Android uses heavily to share
as many private COW pages as possible between processes), Kernel SamePage
Merging, and clean zero pages.  It produces a measurement of the pages
that only exist in that process (USS, for unique), and a measurement of
the physical memory usage of that process with the cost of shared pages
being evenly split between processes that share them (PSS).

If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap walking
tool that can understand the heap debugging of every layer, or for every
layer's heap debugging tools to implement the pagemap walking logic, in
which case it is hard to get a consistent view of memory across the whole
system.

Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that
somebody needs to clean up on crashes.  It needs to be readable while
the process is still running, so it has to have some sort of
synchronization with every layer of userspace.  Efficiently tracking
the ranges requires reimplementing something like the kernel vma
trees, and linking to it from every layer of userspace.  It requires
more memory, more syscalls, more runtime cost, and more complexity to
separately track regions that the kernel is already tracking.

This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas.  The names of named anonymous
vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The name is stored in a user pointer in the shared union in vm_area_struct
that points to a null terminated string inside the user process.  vmas
that point to the same address and are otherwise mergeable will be merged,
but vmas that point to equivalent strings at different addresses will not
be merged.

The idea to store a userspace pointer to reduce the complexity within mm
(at the expense of the complexity of reading /proc/pid/mem) came from Dave
Hansen.  This results in no runtime overhead in the mm subsystem other
than comparing the anon_name pointers when considering vma merging.  The
pointer is stored in a union with fields that are only used on file-backed
mappings, so it does not increase memory usage.
(Upstream changed to remove the union, so this patch adds it back as well)

Signed-off-by: Colin Cross <ccross@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Rob Landley <rob@landley.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>

---

v2: updates the commit message to explain in more detail why the
    patch is useful.
v3: renames vma_get_anon_name to vma_anon_name
    replaces logic in seq_print_vma_name with access_process_vm
    removes Name: entry from smaps, it's already on the header line
    changes the prctl option number to match what is currently in
       use on Android
v4: adds paragraph to commit log on why this is better than tracking
       in userspace
    squashes fixes from Andrew Morton to fix build error and warning
    fix build error reported by Mark Salter when !CONFIG_MMU
v5: rebased to v5.9-rc1, added minor fixes to match upstream
v6: rebased to v5.9-rc3, and addressed review comments:
       - added missing callers in fs/userfaultd.c
       - simplified the union
       - use the new access_remote_vm_locked() in show_map_vma()
         since that already holds mmap_lock
v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't
      defined
---
 Documentation/filesystems/proc.rst |  2 ++
 fs/proc/task_mmu.c                 | 24 ++++++++++++-
 fs/userfaultfd.c                   |  7 ++--
 include/linux/mm.h                 | 12 ++++++-
 include/linux/mm_types.h           | 25 +++++++++++---
 include/uapi/linux/prctl.h         |  3 ++
 kernel/sys.c                       | 32 ++++++++++++++++++
 mm/interval_tree.c                 |  2 +-
 mm/madvise.c                       | 54 +++++++++++++++++++++++++++---
 mm/mempolicy.c                     |  3 +-
 mm/mlock.c                         |  2 +-
 mm/mmap.c                          | 38 ++++++++++++---------
 mm/mprotect.c                      |  2 +-
 13 files changed, 173 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 533c79e8d2cd..41a9cea73b8b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -429,6 +429,8 @@ is not associated with a file:
  [stack]                    the stack of the main process
  [vdso]                     the "virtual dynamic shared object",
                             the kernel system call handler
+[anon:<name>]               an anonymous mapping that has been
+                            named by userspace
  =======                    ====================================
 
  or if empty, the mapping is anonymous.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5066b0251ed8..290ce5ecad0d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -97,6 +97,21 @@ unsigned long task_statm(struct mm_struct *mm,
 	return mm->total_vm;
 }
 
+static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	char anon_name[NAME_MAX + 1];
+	int n;
+
+	n = access_remote_vm_locked(mm, (unsigned long)vma_anon_name(vma), anon_name,
+				    NAME_MAX, 0);
+	if (n > 0) {
+		seq_puts(m, "[anon:");
+		seq_write(m, anon_name, strnlen(anon_name, n));
+		seq_putc(m, ']');
+	}
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Save get_task_policy() for show_numa_map().
@@ -319,8 +334,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 			goto done;
 		}
 
-		if (is_stack(vma))
+		if (is_stack(vma)) {
 			name = "[stack]";
+			goto done;
+		}
+
+		if (vma_anon_name(vma)) {
+			seq_pad(m, ' ');
+			seq_print_vma_name(m, vma);
+		}
 	}
 
 done:
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0e4a3837da52..0741fc9c57c6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -874,7 +874,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 					 new_flags, vma->anon_vma,
 					 vma->vm_file, vma->vm_pgoff,
 					 vma_policy(vma),
-					 NULL_VM_UFFD_CTX);
+					 NULL_VM_UFFD_CTX, vma_anon_name(vma));
 			if (prev)
 				vma = prev;
 			else
@@ -1425,7 +1425,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 ((struct vm_userfaultfd_ctx){ ctx }));
+				 ((struct vm_userfaultfd_ctx){ ctx }),
+				 vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			goto next;
@@ -1597,7 +1598,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 NULL_VM_UFFD_CTX);
+				 NULL_VM_UFFD_CTX, vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			goto next;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9212c0bb5ac..430bfd449015 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2491,7 +2491,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *, struct vm_userfaultfd_ctx);
+	struct mempolicy *, struct vm_userfaultfd_ctx, const char __user *);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
 	unsigned long addr, int new_below);
@@ -3130,5 +3130,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+#ifdef CONFIG_ADVISE_SYSCALLS
+int madvise_set_anon_name(unsigned long start, unsigned long len_in,
+			  unsigned long name_addr);
+#else
+static inline int madvise_set_anon_name(unsigned long start, unsigned long len_in,
+					unsigned long name_addr) {
+	return 0;
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..f7d54ae487e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -336,11 +336,19 @@ struct vm_area_struct {
 	/*
 	 * For areas with an address space and backing store,
 	 * linkage into the address_space->i_mmap interval tree.
+	 *
+	 * For private anonymous mappings, a pointer to a null terminated string
+	 * in the user process containing the name given to the vma, or NULL
+	 * if unnamed.
 	 */
-	struct {
-		struct rb_node rb;
-		unsigned long rb_subtree_last;
-	} shared;
+
+	union {
+		struct {
+			struct rb_node rb;
+			unsigned long rb_subtree_last;
+		} shared;
+		const char __user *anon_name;
+	};
 
 	/*
 	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
@@ -772,4 +780,13 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/* Return the name for an anonymous mapping or NULL for a file-backed mapping */
+static inline const char __user *vma_anon_name(struct vm_area_struct *vma)
+{
+	if (vma->vm_file)
+		return NULL;
+
+	return vma->anon_name;
+}
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..10773270f67b 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+#define PR_SET_VMA		0x53564d41
+# define PR_SET_VMA_ANON_NAME		0
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index ab6c409b1159..1af718c1d812 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2280,6 +2280,35 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 
 #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
 
+#ifdef CONFIG_MMU
+static int prctl_set_vma(unsigned long opt, unsigned long addr,
+			 unsigned long len, unsigned long arg)
+{
+	struct mm_struct *mm = current->mm;
+	int error;
+
+	mmap_write_lock(mm);
+
+	switch (opt) {
+	case PR_SET_VMA_ANON_NAME:
+		error = madvise_set_anon_name(addr, len, arg);
+		break;
+	default:
+		error = -EINVAL;
+	}
+
+	mmap_write_unlock(mm);
+
+	return error;
+}
+#else /* CONFIG_MMU */
+static int prctl_set_vma(unsigned long opt, unsigned long start,
+			 unsigned long len_in, unsigned long arg)
+{
+	return -EINVAL;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2530,6 +2559,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_SET_VMA:
+		error = prctl_set_vma(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/mm/interval_tree.c b/mm/interval_tree.c
index 11c75fb07584..dd522625d6f6 100644
--- a/mm/interval_tree.c
+++ b/mm/interval_tree.c
@@ -45,7 +45,7 @@ void vma_interval_tree_insert_after(struct vm_area_struct *node,
 			parent->shared.rb_subtree_last = last;
 		while (parent->shared.rb.rb_left) {
 			parent = rb_entry(parent->shared.rb.rb_left,
-				struct vm_area_struct, shared.rb);
+					  struct vm_area_struct, shared.rb);
 			if (parent->shared.rb_subtree_last < last)
 				parent->shared.rb_subtree_last = last;
 		}
diff --git a/mm/madvise.c b/mm/madvise.c
index 84482c21b029..30c7326e7863 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -65,13 +65,14 @@ static int madvise_need_mmap_write(int behavior)
  */
 static int madvise_update_vma(struct vm_area_struct *vma,
 			      struct vm_area_struct **prev, unsigned long start,
-			      unsigned long end, unsigned long new_flags)
+			      unsigned long end, unsigned long new_flags,
+			      const char __user *new_anon_name)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int error;
 	pgoff_t pgoff;
 
-	if (new_flags == vma->vm_flags) {
+	if (new_flags == vma->vm_flags && new_anon_name == vma_anon_name(vma)) {
 		*prev = vma;
 		return 0;
 	}
@@ -79,7 +80,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma),
-			  vma->vm_userfaultfd_ctx);
+			  vma->vm_userfaultfd_ctx, new_anon_name);
 	if (*prev) {
 		vma = *prev;
 		goto success;
@@ -112,10 +113,30 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	 * vm_flags is protected by the mmap_lock held in write mode.
 	 */
 	vma->vm_flags = new_flags;
+	if (!vma->vm_file)
+		vma->anon_name = new_anon_name;
 
 	return 0;
 }
 
+static int madvise_vma_anon_name(struct vm_area_struct *vma,
+				 struct vm_area_struct **prev,
+				 unsigned long start, unsigned long end,
+				 unsigned long name_addr)
+{
+	int error;
+
+	/* Only anonymous mappings can be named */
+	if (vma->vm_file)
+		return -EINVAL;
+
+	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
+				   (const char __user *)name_addr);
+	if (error == -ENOMEM)
+		error = -EAGAIN;
+	return error;
+}
+
 #ifdef CONFIG_SWAP
 static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	unsigned long end, struct mm_walk *walk)
@@ -877,7 +898,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		break;
 	}
 
-	error = madvise_update_vma(vma, prev, start, end, new_flags);
+	error = madvise_update_vma(vma, prev, start, end, new_flags,
+				   vma_anon_name(vma));
 
 out:
 	if (error == -ENOMEM)
@@ -1059,6 +1081,30 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
 	return unmapped_error;
 }
 
+int madvise_set_anon_name(unsigned long start, unsigned long len_in,
+			  unsigned long name_addr)
+{
+	unsigned long end;
+	unsigned long len;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+	/* Check to see whether len was rounded up from small -ve to zero */
+	if (len_in && !len)
+		return -EINVAL;
+
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	return madvise_walk_vmas(start, end, name_addr, madvise_vma_anon_name);
+}
+
 /*
  * The madvise(2) system call.
  *
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..94338d9bfe57 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -829,7 +829,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
 				 vma->anon_vma, vma->vm_file, pgoff,
-				 new_pol, vma->vm_userfaultfd_ctx);
+				 new_pol, vma->vm_userfaultfd_ctx,
+				 vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..8e0046c4642f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -534,7 +534,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma),
-			  vma->vm_userfaultfd_ctx);
+			  vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5f..8f3cd352a48f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -987,7 +987,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 				struct file *file, unsigned long vm_flags,
-				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+				const char __user *anon_name)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -1005,6 +1006,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
 		return 0;
+	if (vma_anon_name(vma) != anon_name)
+		return 0;
 	return 1;
 }
 
@@ -1037,9 +1040,10 @@ static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
 		     struct anon_vma *anon_vma, struct file *file,
 		     pgoff_t vm_pgoff,
-		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+		     const char __user *anon_name)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -1058,9 +1062,10 @@ static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 		    struct anon_vma *anon_vma, struct file *file,
 		    pgoff_t vm_pgoff,
-		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+		     const char __user *anon_name)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1071,9 +1076,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 }
 
 /*
- * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
- * whether that can be merged with its predecessor or its successor.
- * Or both (it neatly fills a hole).
+ * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
+ * figure out whether that can be merged with its predecessor or its
+ * successor.  Or both (it neatly fills a hole).
  *
  * In most cases - when called for mmap, brk or mremap - [addr,end) is
  * certain not to be mapped by the time vma_merge is called; but when
@@ -1118,7 +1123,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			unsigned long end, unsigned long vm_flags,
 			struct anon_vma *anon_vma, struct file *file,
 			pgoff_t pgoff, struct mempolicy *policy,
-			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+			const char __user *anon_name)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1151,7 +1157,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
 					    anon_vma, file, pgoff,
-					    vm_userfaultfd_ctx)) {
+					    vm_userfaultfd_ctx, anon_name)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
@@ -1160,7 +1166,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 				can_vma_merge_before(next, vm_flags,
 						     anon_vma, file,
 						     pgoff+pglen,
-						     vm_userfaultfd_ctx) &&
+						     vm_userfaultfd_ctx, anon_name) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1183,7 +1189,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
 					     anon_vma, file, pgoff+pglen,
-					     vm_userfaultfd_ctx)) {
+					     vm_userfaultfd_ctx, anon_name)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = __vma_adjust(prev, prev->vm_start,
 					 addr, prev->vm_pgoff, NULL, next);
@@ -1731,7 +1737,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Can we just expand an old mapping?
 	 */
 	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
+			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 	if (vma)
 		goto out;
 
@@ -1779,7 +1785,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && prev)) {
 			merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
+				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 			if (merge) {
 				fput(file);
 				vm_area_free(vma);
@@ -3063,7 +3069,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
+			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 	if (vma)
 		goto out;
 
@@ -3262,7 +3268,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
 			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			    vma->vm_userfaultfd_ctx);
+			    vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..d90c349a3fd9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -454,7 +454,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
 			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			   vma->vm_userfaultfd_ctx);
+			   vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (*pprev) {
 		vma = *pprev;
 		VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
@ 2020-09-03 13:25   ` Kirill A. Shutemov
  2020-09-03 13:43     ` Matthew Wilcox
  2020-09-03 18:00   ` Kees Cook
  1 sibling, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2020-09-03 13:25 UTC (permalink / raw)
  To: Sumit Semwal
  Cc: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Kees Cook, Michal Hocko,
	Colin Cross, Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Tue, Sep 01, 2020 at 09:44:59PM +0530, Sumit Semwal wrote:
> From: Colin Cross <ccross@google.com>
> 
> In many userspace applications, and especially in VM based applications
> like Android uses heavily, there are multiple different allocators in use.
>  At a minimum there is libc malloc and the stack, and in many cases there
> are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> multiple VM heaps (one for small objects, one for big objects, etc.).
> Each of these layers usually has its own tools to inspect its usage;
> malloc by compiling a debug version, the VM through heap inspection tools,
> and for direct syscalls there is usually no way to track them.
> 
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.
> 
> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> userspace-provided name for anonymous vmas.  The names of named anonymous
> vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].

Hm. I guess that there might be tools that expect the field to be empty
for anonymous memory, no?
 
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> Setting the name to NULL clears it.
> 
> The name is stored in a user pointer in the shared union in vm_area_struct
> that points to a null terminated string inside the user process.  vmas
> that point to the same address and are otherwise mergeable will be merged,
> but vmas that point to equivalent strings at different addresses will not
> be merged.
> 
> The idea to store a userspace pointer to reduce the complexity within mm
> (at the expense of the complexity of reading /proc/pid/mem) came from Dave
> Hansen.  This results in no runtime overhead in the mm subsystem other
> than comparing the anon_name pointers when considering vma merging.  The
> pointer is stored in a union with fields that are only used on file-backed
> mappings, so it does not increase memory usage.
> (Upstream changed to remove the union, so this patch adds it back as well)

IIUC, it gives userspace direct control of content of /proc/$PID/maps and
/proc/$PID/smaps. There's no verification of the given string whatsoever.
I'm sure security experts would find clever usage of the feature :P

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 13:25   ` Kirill A. Shutemov
@ 2020-09-03 13:43     ` Matthew Wilcox
  2020-09-03 13:58       ` Kirill A. Shutemov
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2020-09-03 13:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sumit Semwal, Andrew Morton, linux-mm, linux-kernel,
	Alexey Dobriyan, Jonathan Corbet, Mauro Carvalho Chehab,
	Kees Cook, Michal Hocko, Colin Cross, Alexey Gladkov,
	Jason Gunthorpe, Kirill A . Shutemov, Michel Lespinasse,
	Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 03, 2020 at 04:25:37PM +0300, Kirill A. Shutemov wrote:
> IIUC, it gives userspace direct control of content of /proc/$PID/maps and
> /proc/$PID/smaps. There's no verification of the given string whatsoever.
> I'm sure security experts would find clever usage of the feature :P

What, you think that naming a VMA
"\n55bc3e0f9000-55bc3e0fb000 r--p 00000000 fd:01 16777285                   /bin/cat" might cause problems?

Would it be enough to restrict the characters to isalnum()?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 13:43     ` Matthew Wilcox
@ 2020-09-03 13:58       ` Kirill A. Shutemov
  2020-09-03 15:59         ` Colin Cross
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2020-09-03 13:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Sumit Semwal, Andrew Morton, linux-mm,
	linux-kernel, Alexey Dobriyan, Jonathan Corbet,
	Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Colin Cross,
	Alexey Gladkov, Jason Gunthorpe, Michel Lespinasse,
	Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 03, 2020 at 02:43:40PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 04:25:37PM +0300, Kirill A. Shutemov wrote:
> > IIUC, it gives userspace direct control of content of /proc/$PID/maps and
> > /proc/$PID/smaps. There's no verification of the given string whatsoever.
> > I'm sure security experts would find clever usage of the feature :P
> 
> What, you think that naming a VMA
> "\n55bc3e0f9000-55bc3e0fb000 r--p 00000000 fd:01 16777285                   /bin/cat" might cause problems?

Something that would cause buffer overrun or out-of-bound access in a
privilaged parser can be even more interesting. :)

> Would it be enough to restrict the characters to isalnum()?

I guess.

But current design stores userspace pointer and there's time-of-check vs.
time-of-use problem.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 13:58       ` Kirill A. Shutemov
@ 2020-09-03 15:59         ` Colin Cross
  2020-09-03 17:31           ` Kees Cook
  0 siblings, 1 reply; 15+ messages in thread
From: Colin Cross @ 2020-09-03 15:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kirill A. Shutemov, Sumit Semwal, Andrew Morton,
	Linux-MM, lkml, Alexey Dobriyan, Jonathan Corbet,
	Mauro Carvalho Chehab, Kees Cook, Michal Hocko, Alexey Gladkov,
	Jason Gunthorpe, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 3, 2020 at 6:58 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Thu, Sep 03, 2020 at 02:43:40PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 03, 2020 at 04:25:37PM +0300, Kirill A. Shutemov wrote:
> > > IIUC, it gives userspace direct control of content of /proc/$PID/maps and
> > > /proc/$PID/smaps. There's no verification of the given string whatsoever.
> > > I'm sure security experts would find clever usage of the feature :P
> >
> > What, you think that naming a VMA
> > "\n55bc3e0f9000-55bc3e0fb000 r--p 00000000 fd:01 16777285                   /bin/cat" might cause problems?

The data is wrapped inside "[anon: ]", which limits the ability to
masquerade as a real file.

> Something that would cause buffer overrun or out-of-bound access in a
> privilaged parser can be even more interesting. :)

This is the same as /proc/pid/cmdline, which has no sanitization.
It's also limited to 255 bytes, which should hopefully limit the
opportunity for a buffer overrun.

> > Would it be enough to restrict the characters to isalnum()?
>
> I guess.
>
> But current design stores userspace pointer and there's time-of-check vs.
> time-of-use problem.

It copies from userspace into a kernel buffer at read time, any
desired sanitization could easily be added there.

> --
>  Kirill A. Shutemov
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 15:59         ` Colin Cross
@ 2020-09-03 17:31           ` Kees Cook
  2020-09-03 17:54             ` Colin Cross
  0 siblings, 1 reply; 15+ messages in thread
From: Kees Cook @ 2020-09-03 17:31 UTC (permalink / raw)
  To: Colin Cross
  Cc: Kirill A. Shutemov, Matthew Wilcox, Kirill A. Shutemov,
	Sumit Semwal, Andrew Morton, Linux-MM, lkml, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Michal Hocko,
	Alexey Gladkov, Jason Gunthorpe, Michel Lespinasse,
	Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 03, 2020 at 08:59:38AM -0700, Colin Cross wrote:
> On Thu, Sep 3, 2020 at 6:58 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Thu, Sep 03, 2020 at 02:43:40PM +0100, Matthew Wilcox wrote:
> > > On Thu, Sep 03, 2020 at 04:25:37PM +0300, Kirill A. Shutemov wrote:
> > > > IIUC, it gives userspace direct control of content of /proc/$PID/maps and
> > > > /proc/$PID/smaps. There's no verification of the given string whatsoever.
> > > > I'm sure security experts would find clever usage of the feature :P
> > >
> > > What, you think that naming a VMA
> > > "\n55bc3e0f9000-55bc3e0fb000 r--p 00000000 fd:01 16777285                   /bin/cat" might cause problems?
> 
> The data is wrapped inside "[anon: ]", which limits the ability to
> masquerade as a real file.

That's true, but it's insufficient to avoid spoofing parsers (e.g. if I
set my name to "hiding]\nfake-maps-line-here [anon: evil"

> > Something that would cause buffer overrun or out-of-bound access in a
> > privilaged parser can be even more interesting. :)
> 
> This is the same as /proc/pid/cmdline, which has no sanitization.
> It's also limited to 255 bytes, which should hopefully limit the
> opportunity for a buffer overrun.

/proc/$pid/cmdline contains a "single item", in the sense that the
entire field is contained. Confusing parsers is certainly still
possible, but the bounds for it are distinct in that there is nothing
else in that file.

The better analogy is with /proc/$pid/status, which is multi-line like
maps, and *does* perform escaping, e.g.:

$ cat sneaky.c
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        char * const args[] = {
                "four\nfive\nsix",
                NULL,
        };
        return execv("./one\ntwo\nthree", args);
}
$ head -n1 /proc/$pid/status
Name:   one\ntwo\nthree
$ cat /proc/$pid/cmdline
four
five
six

> > > Would it be enough to restrict the characters to isalnum()?
> >
> > I guess.
> >
> > But current design stores userspace pointer and there's time-of-check vs.
> > time-of-use problem.
> 
> It copies from userspace into a kernel buffer at read time, any
> desired sanitization could easily be added there.

I would prefer having strict validation of the input over escaping the
output, so to that end how about making close to "variable name" sane:
[-\.a-zA-Z0-9_ ] ?

if it should be wider than that, how about printable minus \n \r \f \v [ ] ?

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 17:31           ` Kees Cook
@ 2020-09-03 17:54             ` Colin Cross
  2020-09-03 18:00               ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Colin Cross @ 2020-09-03 17:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: Kirill A. Shutemov, Matthew Wilcox, Kirill A. Shutemov,
	Sumit Semwal, Andrew Morton, Linux-MM, lkml, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Michal Hocko,
	Alexey Gladkov, Jason Gunthorpe, Michel Lespinasse,
	Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 3, 2020 at 10:31 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Sep 03, 2020 at 08:59:38AM -0700, Colin Cross wrote:
> > On Thu, Sep 3, 2020 at 6:58 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Thu, Sep 03, 2020 at 02:43:40PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Sep 03, 2020 at 04:25:37PM +0300, Kirill A. Shutemov wrote:
> > > > > IIUC, it gives userspace direct control of content of /proc/$PID/maps and
> > > > > /proc/$PID/smaps. There's no verification of the given string whatsoever.
> > > > > I'm sure security experts would find clever usage of the feature :P
> > > >
> > > > What, you think that naming a VMA
> > > > "\n55bc3e0f9000-55bc3e0fb000 r--p 00000000 fd:01 16777285                   /bin/cat" might cause problems?
> >
> > The data is wrapped inside "[anon: ]", which limits the ability to
> > masquerade as a real file.
>
> That's true, but it's insufficient to avoid spoofing parsers (e.g. if I
> set my name to "hiding]\nfake-maps-line-here [anon: evil"
>
> > > Something that would cause buffer overrun or out-of-bound access in a
> > > privilaged parser can be even more interesting. :)
> >
> > This is the same as /proc/pid/cmdline, which has no sanitization.
> > It's also limited to 255 bytes, which should hopefully limit the
> > opportunity for a buffer overrun.
>
> /proc/$pid/cmdline contains a "single item", in the sense that the
> entire field is contained. Confusing parsers is certainly still
> possible, but the bounds for it are distinct in that there is nothing
> else in that file.
>
> The better analogy is with /proc/$pid/status, which is multi-line like
> maps, and *does* perform escaping, e.g.:
>
> $ cat sneaky.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
>         char * const args[] = {
>                 "four\nfive\nsix",
>                 NULL,
>         };
>         return execv("./one\ntwo\nthree", args);
> }
> $ head -n1 /proc/$pid/status
> Name:   one\ntwo\nthree
> $ cat /proc/$pid/cmdline
> four
> five
> six
>
> > > > Would it be enough to restrict the characters to isalnum()?
> > >
> > > I guess.
> > >
> > > But current design stores userspace pointer and there's time-of-check vs.
> > > time-of-use problem.
> >
> > It copies from userspace into a kernel buffer at read time, any
> > desired sanitization could easily be added there.
>
> I would prefer having strict validation of the input over escaping the
> output, so to that end how about making close to "variable name" sane:
> [-\.a-zA-Z0-9_ ] ?

A quick skim of existing Android cases shows at least ":()" as well.
I'm not sure what you mean by validation of the input - the input to
the prctl is a userspace pointer, which is stored in the kernel for
later reads.  Storing the string in the kernel at prctl time would be
infeasible.  The kernel can only validate the value when producing
/proc/pid/maps.  It could replace disallowed characters with _ though.

> if it should be wider than that, how about printable minus \n \r \f \v [ ] ?

That would work fine for Android.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 17:54             ` Colin Cross
@ 2020-09-03 18:00               ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-09-03 18:00 UTC (permalink / raw)
  To: Colin Cross, Kees Cook
  Cc: Kirill A. Shutemov, Matthew Wilcox, Kirill A. Shutemov,
	Sumit Semwal, Andrew Morton, Linux-MM, lkml, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Michal Hocko,
	Alexey Gladkov, Jason Gunthorpe, Michel Lespinasse,
	Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Peter Zijlstra, Ingo Molnar,
	Oleg Nesterov, Eric W. Biederman, Jan Glauber, Rob Landley,
	Cyrill Gorcunov, Serge E. Hallyn, David Rientjes, Hugh Dickins,
	Rik van Riel, Mel Gorman, Tang Chen, Robin Holt, Shaohua Li,
	Sasha Levin, Johannes Weiner, Minchan Kim

On 9/3/20 10:54 AM, Colin Cross wrote:
> On Thu, Sep 3, 2020 at 10:31 AM Kees Cook <keescook@chromium.org> wrote:
>> I would prefer having strict validation of the input over escaping the
>> output, so to that end how about making close to "variable name" sane:
>> [-\.a-zA-Z0-9_ ] ?
> 
> A quick skim of existing Android cases shows at least ":()" as well.
> I'm not sure what you mean by validation of the input - the input to
> the prctl is a userspace pointer, which is stored in the kernel for
> later reads.  Storing the string in the kernel at prctl time would be
> infeasible.  The kernel can only validate the value when producing
> /proc/pid/maps.  It could replace disallowed characters with _ though.
> 
>> if it should be wider than that, how about printable minus \n \r \f \v [ ] ?
> 
> That would work fine for Android.

Looks like /proc/$pid/maps and smaps filter out \n (and probably more)
for filenames.  I created a file named "foo\nbar" then mmap'd it.
Here's what I got:

> 01324000-7f1854634000 rw-s 00000000 00:30 9308668                        /home/dave/foo\012bar

Can we leverage what we already do for filenames?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
  2020-09-03 13:25   ` Kirill A. Shutemov
@ 2020-09-03 18:00   ` Kees Cook
  2020-09-03 18:09     ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Kees Cook @ 2020-09-03 18:00 UTC (permalink / raw)
  To: Sumit Semwal
  Cc: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Michal Hocko,
	Colin Cross, Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Dave Hansen, Peter Zijlstra,
	Ingo Molnar, Oleg Nesterov, Eric W. Biederman, Jan Glauber,
	Rob Landley, Cyrill Gorcunov, Serge E. Hallyn, David Rientjes,
	Hugh Dickins, Rik van Riel, Mel Gorman, Tang Chen, Robin Holt,
	Shaohua Li, Sasha Levin, Johannes Weiner, Minchan Kim

On Tue, Sep 01, 2020 at 09:44:59PM +0530, Sumit Semwal wrote:
> From: Colin Cross <ccross@google.com>
> 
> In many userspace applications, and especially in VM based applications
> like Android uses heavily, there are multiple different allocators in use.
>  At a minimum there is libc malloc and the stack, and in many cases there
> are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> multiple VM heaps (one for small objects, one for big objects, etc.).
> Each of these layers usually has its own tools to inspect its usage;
> malloc by compiling a debug version, the VM through heap inspection tools,
> and for direct syscalls there is usually no way to track them.
> 
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.
> 
> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> userspace-provided name for anonymous vmas.  The names of named anonymous
> vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> 
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> Setting the name to NULL clears it.
> 
> The name is stored in a user pointer in the shared union in vm_area_struct
> that points to a null terminated string inside the user process.  vmas
> that point to the same address and are otherwise mergeable will be merged,
> but vmas that point to equivalent strings at different addresses will not
> be merged.
> 
> The idea to store a userspace pointer to reduce the complexity within mm
> (at the expense of the complexity of reading /proc/pid/mem) came from Dave
> Hansen.  This results in no runtime overhead in the mm subsystem other
> than comparing the anon_name pointers when considering vma merging.  The
> pointer is stored in a union with fields that are only used on file-backed
> mappings, so it does not increase memory usage.
> (Upstream changed to remove the union, so this patch adds it back as well)

This is, for me, the weirdest part of the series. Very little in the
kernel does stuff like this, and given the Fun(tm) we've had with things
like userfaultfd, I cannot begin to imagine what it'd be nice to have
this kind of control from a process over what is reading its maps file.
(i.e. stalling a maps reader by installing a userfaultfd for the
anon_name: that'd be a DoS for anything trying to read the maps file,
etc.)

Why is a kernel-copied string insufficient for this? I don't think VMA
merging is a fast-path operation, so doing a strcmp isn't going to wreck
anything...

Let me try to find the earlier thread with Dave Hansen... okay, found it:
https://lore.kernel.org/linux-mm/51DDF071.5000309@intel.com/

Right, so, this idea predates userfaultfd. :)

More notes below, but I *really* think this should not be a userspace
pointer. And since a separate union has been found, let's just do a
strndup_user() for the name, validate it as containing only printable
characters without \n \r \v \f and move the merging logic into a
separate patch.

> 
> Signed-off-by: Colin Cross <ccross@google.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Jan Glauber <jan.glauber@gmail.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Rob Landley <rob@landley.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: Robin Holt <holt@sgi.com>
> Cc: Shaohua Li <shli@fusionio.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
> 
> ---
> 
> v2: updates the commit message to explain in more detail why the
>     patch is useful.
> v3: renames vma_get_anon_name to vma_anon_name
>     replaces logic in seq_print_vma_name with access_process_vm
>     removes Name: entry from smaps, it's already on the header line
>     changes the prctl option number to match what is currently in
>        use on Android
> v4: adds paragraph to commit log on why this is better than tracking
>        in userspace
>     squashes fixes from Andrew Morton to fix build error and warning
>     fix build error reported by Mark Salter when !CONFIG_MMU
> v5: rebased to v5.9-rc1, added minor fixes to match upstream
> v6: rebased to v5.9-rc3, and addressed review comments:
>        - added missing callers in fs/userfaultd.c
>        - simplified the union
>        - use the new access_remote_vm_locked() in show_map_vma()
>          since that already holds mmap_lock
> v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't
>       defined

(In the future, can you link to lore URLs for the earlier versions? It
makes it easier to do research on past threads...)

> ---
>  Documentation/filesystems/proc.rst |  2 ++
>  fs/proc/task_mmu.c                 | 24 ++++++++++++-
>  fs/userfaultfd.c                   |  7 ++--
>  include/linux/mm.h                 | 12 ++++++-
>  include/linux/mm_types.h           | 25 +++++++++++---
>  include/uapi/linux/prctl.h         |  3 ++
>  kernel/sys.c                       | 32 ++++++++++++++++++
>  mm/interval_tree.c                 |  2 +-
>  mm/madvise.c                       | 54 +++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |  3 +-
>  mm/mlock.c                         |  2 +-
>  mm/mmap.c                          | 38 ++++++++++++---------
>  mm/mprotect.c                      |  2 +-
>  13 files changed, 173 insertions(+), 33 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 533c79e8d2cd..41a9cea73b8b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -429,6 +429,8 @@ is not associated with a file:
>   [stack]                    the stack of the main process
>   [vdso]                     the "virtual dynamic shared object",
>                              the kernel system call handler
> +[anon:<name>]               an anonymous mapping that has been

typo: needs a single leading space here.

> +                            named by userspace
>   =======                    ====================================
>  
>   or if empty, the mapping is anonymous.
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5066b0251ed8..290ce5ecad0d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -97,6 +97,21 @@ unsigned long task_statm(struct mm_struct *mm,
>  	return mm->total_vm;
>  }
>  
> +static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	char anon_name[NAME_MAX + 1];
> +	int n;
> +
> +	n = access_remote_vm_locked(mm, (unsigned long)vma_anon_name(vma), anon_name,
> +				    NAME_MAX, 0);
> +	if (n > 0) {
> +		seq_puts(m, "[anon:");
> +		seq_write(m, anon_name, strnlen(anon_name, n));
> +		seq_putc(m, ']');

seq_printf(m, "[anon:%s]", anon_name); ?

> +	}
> +}
> [...]
> +#ifdef CONFIG_ADVISE_SYSCALLS
> +int madvise_set_anon_name(unsigned long start, unsigned long len_in,
> +			  unsigned long name_addr);
> +#else
> +static inline int madvise_set_anon_name(unsigned long start, unsigned long len_in,
> +					unsigned long name_addr) {
> +	return 0;

Why not -EINVAL?

> +}
> +#endif
> [...]
> +static int madvise_vma_anon_name(struct vm_area_struct *vma,
> +				 struct vm_area_struct **prev,
> +				 unsigned long start, unsigned long end,
> +				 unsigned long name_addr)
> +{
> +	int error;
> +
> +	/* Only anonymous mappings can be named */
> +	if (vma->vm_file)
> +		return -EINVAL;

But a process could fork, and then change the contents of the vma_name
address in one process, showing different names for the same VMA in
maps? O_o That feels like an unexpected result -- the VMA should have
the same name in either process.

> [...]
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  				struct file *file, unsigned long vm_flags,
> -				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +				struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +				const char __user *anon_name)

I think at the very least, this patch should be split in half: first add
the feature, and then add the ability to merge.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 18:00   ` Kees Cook
@ 2020-09-03 18:09     ` Dave Hansen
  2020-09-03 18:26       ` Colin Cross
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2020-09-03 18:09 UTC (permalink / raw)
  To: Kees Cook, Sumit Semwal
  Cc: Andrew Morton, linux-mm, linux-kernel, Alexey Dobriyan,
	Jonathan Corbet, Mauro Carvalho Chehab, Michal Hocko,
	Colin Cross, Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Peter Zijlstra, Ingo Molnar,
	Oleg Nesterov, Eric W. Biederman, Jan Glauber, Rob Landley,
	Cyrill Gorcunov, Serge E. Hallyn, David Rientjes, Hugh Dickins,
	Rik van Riel, Mel Gorman, Tang Chen, Robin Holt, Shaohua Li,
	Sasha Levin, Johannes Weiner, Minchan Kim

On 9/3/20 11:00 AM, Kees Cook wrote:
> Why is a kernel-copied string insufficient for this? I don't think VMA
> merging is a fast-path operation, so doing a strcmp isn't going to wreck
> anything...
> 
> Let me try to find the earlier thread with Dave Hansen... okay, found it:
> https://lore.kernel.org/linux-mm/51DDF071.5000309@intel.com/
> 
> Right, so, this idea predates userfaultfd. :)
> 
> More notes below, but I *really* think this should not be a userspace
> pointer. And since a separate union has been found, let's just do a
> strndup_user() for the name, validate it as containing only printable
> characters without \n \r \v \f and move the merging logic into a
> separate patch.

FWIW, I don't have any objections to this.

Refcounting strings was what I think I had the strongest reaction to
back in the good old days of 2013.  strdup() on split plus strcmp() on
merge doesn't sound afwul to me, and it is darn straightforward.  The
biggest downside is probably kernel memory consumption.  We should
probably just think through whether having so many duplicates changes
things materially.

For instance, should/could we penalize a task's vm.max_map_count when
it's using this mechanism?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 18:09     ` Dave Hansen
@ 2020-09-03 18:26       ` Colin Cross
  2020-09-03 18:40         ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Colin Cross @ 2020-09-03 18:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Sumit Semwal, Andrew Morton, Linux-MM, lkml,
	Alexey Dobriyan, Jonathan Corbet, Mauro Carvalho Chehab,
	Michal Hocko, Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Peter Zijlstra, Ingo Molnar,
	Oleg Nesterov, Eric W. Biederman, Jan Glauber, Rob Landley,
	Cyrill Gorcunov, Serge E. Hallyn, David Rientjes, Hugh Dickins,
	Rik van Riel, Mel Gorman, Tang Chen, Robin Holt, Shaohua Li,
	Sasha Levin, Johannes Weiner, Minchan Kim

On Thu, Sep 3, 2020 at 11:09 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/3/20 11:00 AM, Kees Cook wrote:
> > Why is a kernel-copied string insufficient for this? I don't think VMA
> > merging is a fast-path operation, so doing a strcmp isn't going to wreck
> > anything...
> >
> > Let me try to find the earlier thread with Dave Hansen... okay, found it:
> > https://lore.kernel.org/linux-mm/51DDF071.5000309@intel.com/
> >
> > Right, so, this idea predates userfaultfd. :)
> >
> > More notes below, but I *really* think this should not be a userspace
> > pointer. And since a separate union has been found, let's just do a
> > strndup_user() for the name, validate it as containing only printable
> > characters without \n \r \v \f and move the merging logic into a
> > separate patch.
>
> FWIW, I don't have any objections to this.
>
> Refcounting strings was what I think I had the strongest reaction to
> back in the good old days of 2013.  strdup() on split plus strcmp() on
> merge doesn't sound afwul to me, and it is darn straightforward.  The
> biggest downside is probably kernel memory consumption.  We should
> probably just think through whether having so many duplicates changes
> things materially.
>
> For instance, should/could we penalize a task's vm.max_map_count when
> it's using this mechanism?

Just to provide some concrete numbers, the ART process I examined
(https://pastebin.com/YNUTvZyz) had 280 named anonymous mappings using
a total of 6566 bytes for the names.  There were only 63 unique names,
using 1925 bytes.  On my personal usage device, there are currently a
total of 59769 named anonymous devices across all processes using
1224119 bytes, 5843 of them unique using 121754 bytes.  The vast
majority of the unique names are of the form "stack_and_tls:999",
which are dynamically allocated in the userspace process' heap.  There
are only 132 names that do not contain stack_and_tls using 9540 bytes,
repeated 49938 times using 1030651 bytes (108x).  Most of those are
constant strings, meaning the pointer is into the .rodata section of a
file mapping that is shared between all processes.

Is fork a concern?  It would have to strdup every name.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
  2020-09-03 18:26       ` Colin Cross
@ 2020-09-03 18:40         ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-09-03 18:40 UTC (permalink / raw)
  To: Colin Cross
  Cc: Kees Cook, Sumit Semwal, Andrew Morton, Linux-MM, lkml,
	Alexey Dobriyan, Jonathan Corbet, Mauro Carvalho Chehab,
	Michal Hocko, Alexey Gladkov, Matthew Wilcox, Jason Gunthorpe,
	Kirill A . Shutemov, Michel Lespinasse, Michal Koutný,
	Song Liu, Huang Ying, Vlastimil Babka, Yang Shi, chenqiwu,
	Mathieu Desnoyers, John Hubbard, Mike Christie, Bart Van Assche,
	Amit Pundir, Thomas Gleixner, Christian Brauner, Daniel Jordan,
	Adrian Reber, Nicolas Viennot, Al Viro, linux-fsdevel,
	John Stultz, Pekka Enberg, Peter Zijlstra, Ingo Molnar,
	Oleg Nesterov, Eric W. Biederman, Jan Glauber, Rob Landley,
	Cyrill Gorcunov, David Rientjes, Hugh Dickins, Rik van Riel,
	Mel Gorman, Robin Holt, Shaohua Li, Johannes Weiner, Minchan Kim,
	Sasha Levin, Serge E. Hallyn

On 9/3/20 11:26 AM, Colin Cross wrote:
>> FWIW, I don't have any objections to this.
>>
>> Refcounting strings was what I think I had the strongest reaction to
>> back in the good old days of 2013.  strdup() on split plus strcmp() on
>> merge doesn't sound afwul to me, and it is darn straightforward.  The
>> biggest downside is probably kernel memory consumption.  We should
>> probably just think through whether having so many duplicates changes
>> things materially.
>>
>> For instance, should/could we penalize a task's vm.max_map_count when
>> it's using this mechanism?
> Just to provide some concrete numbers, the ART process I examined
> (https://pastebin.com/YNUTvZyz) had 280 named anonymous mappings using
> a total of 6566 bytes for the names.  There were only 63 unique names,
> using 1925 bytes.  On my personal usage device, there are currently a
> total of 59769 named anonymous devices across all processes using
> 1224119 bytes, 5843 of them unique using 121754 bytes.  The vast
> majority of the unique names are of the form "stack_and_tls:999",
> which are dynamically allocated in the userspace process' heap.  There
> are only 132 names that do not contain stack_and_tls using 9540 bytes,
> repeated 49938 times using 1030651 bytes (108x).  Most of those are
> constant strings, meaning the pointer is into the .rodata section of a
> file mapping that is shared between all processes.

Thanks for the data!  That seems like totally reasonable memory
consumption in the normal cases.

I'm mostly concerned about the worst-case kernel memory consumption.
It's a DoS if there's a big change to the amount of kernel memory a
process can consume.  For instance, each VMA is 192 bytes.  If the name
limit was, say, 4k, it makes the per-VMA kernel memory consumption go up
by a factor of 20, which isn't nice.  But, if it's, say 16 bytes, it's
only a 10% increase over what each VMA consumes today.

> Is fork a concern?  It would have to strdup every name.

Should be pretty easy to confirm the worst case experimentally.  Create
a process with ~64k VMAs, time the fork()s.  Then do the same with 64k
VMAs with the longest names set on each VMA.  If you can't time a
difference, you'll have a strong argument that someone with 10 of these
things in a process won't notice.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-09-03 18:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-01 16:14 [PATCH v7 0/3] Anonymous VMA naming patches Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
2020-09-03 13:25   ` Kirill A. Shutemov
2020-09-03 13:43     ` Matthew Wilcox
2020-09-03 13:58       ` Kirill A. Shutemov
2020-09-03 15:59         ` Colin Cross
2020-09-03 17:31           ` Kees Cook
2020-09-03 17:54             ` Colin Cross
2020-09-03 18:00               ` Dave Hansen
2020-09-03 18:00   ` Kees Cook
2020-09-03 18:09     ` Dave Hansen
2020-09-03 18:26       ` Colin Cross
2020-09-03 18:40         ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).