linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse
@ 2021-09-02 23:18 Suren Baghdasaryan
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-02 23:18 UTC (permalink / raw)
  To: akpm
  Cc: ccross, sumit.semwal, mhocko, dave.hansen, keescook, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team, surenb,
	Pekka Enberg, Ingo Molnar, Oleg Nesterov, Jan Glauber,
	John Stultz, Rob Landley, Cyrill Gorcunov, Serge E. Hallyn,
	David Rientjes, Mel Gorman, Shaohua Li, Minchan Kim

From: Colin Cross <ccross@google.com>

Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.

Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter.  The only caller for now is
sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
the next patch will add an additional caller.

Move handling all vma behaviors inside madvise_behavior, and rename it to
madvise_vma_behavior.

Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma.  The next patch will add support for updating a new
anon_name field as well.

Signed-off-by: Colin Cross <ccross@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Rob Landley <rob@landley.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  [sumits: rebased over v5.9-rc3]
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
  [surenb: rebased over v5.14-rc7]
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
previous version including cover letter with test results is at:
https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/

changes in v9
- Removed unnecessary initialization of 'error' to 0 in madvise_vma_behavior,
per Cyrill Gorcunov
- Replaced goto's with returns in madvise_vma_behavior, per Cyrill Gorcunov
- Recovered the comment explaining why we map ENOMEM to EAGAIN in
madvise_vma_behavior, per Cyrill Gorcunov

 mm/madvise.c | 317 +++++++++++++++++++++++++++------------------------
 1 file changed, 170 insertions(+), 147 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 56324a3dbc4e..54bf9f73f95d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -63,76 +63,20 @@ static int madvise_need_mmap_write(int behavior)
 }
 
 /*
- * We can potentially split a vm area into separate
- * areas, each area with its own behavior.
+ * Update the vm_flags on regiion of a vma, splitting it or merging it as
+ * necessary.  Must be called with mmap_sem held for writing;
  */
-static long madvise_behavior(struct vm_area_struct *vma,
-		     struct vm_area_struct **prev,
-		     unsigned long start, unsigned long end, int behavior)
+static int madvise_update_vma(struct vm_area_struct *vma,
+			      struct vm_area_struct **prev, unsigned long start,
+			      unsigned long end, unsigned long new_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int error = 0;
+	int error;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
-
-	switch (behavior) {
-	case MADV_NORMAL:
-		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
-		break;
-	case MADV_SEQUENTIAL:
-		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
-		break;
-	case MADV_RANDOM:
-		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
-		break;
-	case MADV_DONTFORK:
-		new_flags |= VM_DONTCOPY;
-		break;
-	case MADV_DOFORK:
-		if (vma->vm_flags & VM_IO) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags &= ~VM_DONTCOPY;
-		break;
-	case MADV_WIPEONFORK:
-		/* MADV_WIPEONFORK is only supported on anonymous memory. */
-		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags |= VM_WIPEONFORK;
-		break;
-	case MADV_KEEPONFORK:
-		new_flags &= ~VM_WIPEONFORK;
-		break;
-	case MADV_DONTDUMP:
-		new_flags |= VM_DONTDUMP;
-		break;
-	case MADV_DODUMP:
-		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
-			error = -EINVAL;
-			goto out;
-		}
-		new_flags &= ~VM_DONTDUMP;
-		break;
-	case MADV_MERGEABLE:
-	case MADV_UNMERGEABLE:
-		error = ksm_madvise(vma, start, end, behavior, &new_flags);
-		if (error)
-			goto out_convert_errno;
-		break;
-	case MADV_HUGEPAGE:
-	case MADV_NOHUGEPAGE:
-		error = hugepage_madvise(vma, &new_flags, behavior);
-		if (error)
-			goto out_convert_errno;
-		break;
-	}
 
 	if (new_flags == vma->vm_flags) {
 		*prev = vma;
-		goto out;
+		return 0;
 	}
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
@@ -149,21 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	if (start != vma->vm_start) {
 		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
 			error = -ENOMEM;
-			goto out;
+			return error;
 		}
 		error = __split_vma(mm, vma, start, 1);
 		if (error)
-			goto out_convert_errno;
+			return error;
 	}
 
 	if (end != vma->vm_end) {
 		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
 			error = -ENOMEM;
-			goto out;
+			return error;
 		}
 		error = __split_vma(mm, vma, end, 0);
 		if (error)
-			goto out_convert_errno;
+			return error;
 	}
 
 success:
@@ -172,15 +116,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	 */
 	vma->vm_flags = new_flags;
 
-out_convert_errno:
-	/*
-	 * madvise() returns EAGAIN if kernel resources, such as
-	 * slab, are temporarily unavailable.
-	 */
-	if (error == -ENOMEM)
-		error = -EAGAIN;
-out:
-	return error;
+	return 0;
 }
 
 #ifdef CONFIG_SWAP
@@ -930,6 +866,94 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+/*
+ * Apply an madvise behavior to a region of a vma.  madvise_update_vma
+ * will handle splitting a vm area into separate areas, each area with its own
+ * behavior.
+ */
+static int madvise_vma_behavior(struct vm_area_struct *vma,
+				struct vm_area_struct **prev,
+				unsigned long start, unsigned long end,
+				unsigned long behavior)
+{
+	int error;
+	unsigned long new_flags = vma->vm_flags;
+
+	switch (behavior) {
+	case MADV_REMOVE:
+		return madvise_remove(vma, prev, start, end);
+	case MADV_WILLNEED:
+		return madvise_willneed(vma, prev, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, prev, start, end);
+	case MADV_PAGEOUT:
+		return madvise_pageout(vma, prev, start, end);
+	case MADV_FREE:
+	case MADV_DONTNEED:
+		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+		return madvise_populate(vma, prev, start, end, behavior);
+	case MADV_NORMAL:
+		new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
+		break;
+	case MADV_SEQUENTIAL:
+		new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
+		break;
+	case MADV_RANDOM:
+		new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
+		break;
+	case MADV_DONTFORK:
+		new_flags |= VM_DONTCOPY;
+		break;
+	case MADV_DOFORK:
+		if (vma->vm_flags & VM_IO)
+			return -EINVAL;
+		new_flags &= ~VM_DONTCOPY;
+		break;
+	case MADV_WIPEONFORK:
+		/* MADV_WIPEONFORK is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED)
+			return -EINVAL;
+		new_flags |= VM_WIPEONFORK;
+		break;
+	case MADV_KEEPONFORK:
+		new_flags &= ~VM_WIPEONFORK;
+		break;
+	case MADV_DONTDUMP:
+		new_flags |= VM_DONTDUMP;
+		break;
+	case MADV_DODUMP:
+		if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
+			return -EINVAL;
+		new_flags &= ~VM_DONTDUMP;
+		break;
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+		error = ksm_madvise(vma, start, end, behavior, &new_flags);
+		if (error)
+			goto out;
+		break;
+	case MADV_HUGEPAGE:
+	case MADV_NOHUGEPAGE:
+		error = hugepage_madvise(vma, &new_flags, behavior);
+		if (error)
+			goto out;
+		break;
+	}
+
+	error = madvise_update_vma(vma, prev, start, end, new_flags);
+
+out:
+	/*
+	 * madvise() returns EAGAIN if kernel resources, such as
+	 * slab, are temporarily unavailable.
+	 */
+	if (error == -ENOMEM)
+		error = -EAGAIN;
+	return error;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -978,30 +1002,6 @@ static int madvise_inject_error(int behavior,
 }
 #endif
 
-static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
-{
-	switch (behavior) {
-	case MADV_REMOVE:
-		return madvise_remove(vma, prev, start, end);
-	case MADV_WILLNEED:
-		return madvise_willneed(vma, prev, start, end);
-	case MADV_COLD:
-		return madvise_cold(vma, prev, start, end);
-	case MADV_PAGEOUT:
-		return madvise_pageout(vma, prev, start, end);
-	case MADV_FREE:
-	case MADV_DONTNEED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
-	case MADV_POPULATE_READ:
-	case MADV_POPULATE_WRITE:
-		return madvise_populate(vma, prev, start, end, behavior);
-	default:
-		return madvise_behavior(vma, prev, start, end, behavior);
-	}
-}
-
 static bool
 madvise_behavior_valid(int behavior)
 {
@@ -1054,6 +1054,73 @@ process_madvise_behavior_valid(int behavior)
 	}
 }
 
+/*
+ * Walk the vmas in range [start,end), and call the visit function on each one.
+ * The visit function will get start and end parameters that cover the overlap
+ * between the current vma and the original range.  Any unmapped regions in the
+ * original range will result in this function returning -ENOMEM while still
+ * calling the visit function on all of the existing vmas in the range.
+ * Must be called with the mmap_lock held for reading or writing.
+ */
+static
+int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
+		      unsigned long end, unsigned long arg,
+		      int (*visit)(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev, unsigned long start,
+				   unsigned long end, unsigned long arg))
+{
+	struct vm_area_struct *vma;
+	struct vm_area_struct *prev;
+	unsigned long tmp;
+	int unmapped_error = 0;
+
+	/*
+	 * If the interval [start,end) covers some unmapped address
+	 * ranges, just ignore them, but return -ENOMEM at the end.
+	 * - different from the way of handling in mlock etc.
+	 */
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		int error;
+
+		/* Still start < end. */
+		if (!vma)
+			return -ENOMEM;
+
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			unmapped_error = -ENOMEM;
+			start = vma->vm_start;
+			if (start >= end)
+				break;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = visit(vma, &prev, start, tmp, arg);
+		if (error)
+			return error;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_lock */
+			vma = find_vma(mm, start);
+	}
+
+	return unmapped_error;
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -1126,9 +1193,7 @@ process_madvise_behavior_valid(int behavior)
  */
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
-	unsigned long end, tmp;
-	struct vm_area_struct *vma, *prev;
-	int unmapped_error = 0;
+	unsigned long end;
 	int error = -EINVAL;
 	int write;
 	size_t len;
@@ -1168,51 +1233,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 		mmap_read_lock(mm);
 	}
 
-	/*
-	 * If the interval [start,end) covers some unmapped address
-	 * ranges, just ignore them, but return -ENOMEM at the end.
-	 * - different from the way of handling in mlock etc.
-	 */
-	vma = find_vma_prev(mm, start, &prev);
-	if (vma && start > vma->vm_start)
-		prev = vma;
-
 	blk_start_plug(&plug);
-	for (;;) {
-		/* Still start < end. */
-		error = -ENOMEM;
-		if (!vma)
-			goto out;
-
-		/* Here start < (end|vma->vm_end). */
-		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
-			start = vma->vm_start;
-			if (start >= end)
-				goto out;
-		}
-
-		/* Here vma->vm_start <= start < (end|vma->vm_end) */
-		tmp = vma->vm_end;
-		if (end < tmp)
-			tmp = end;
-
-		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
-			goto out;
-		start = tmp;
-		if (prev && start < prev->vm_end)
-			start = prev->vm_end;
-		error = unmapped_error;
-		if (start >= end)
-			goto out;
-		if (prev)
-			vma = prev->vm_next;
-		else	/* madvise_remove dropped mmap_lock */
-			vma = find_vma(mm, start);
-	}
-out:
+	error = madvise_walk_vmas(mm, start, end, behavior,
+			madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	if (write)
 		mmap_write_unlock(mm);
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-02 23:18 [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
@ 2021-09-02 23:18 ` Suren Baghdasaryan
  2021-09-03 21:35   ` Kees Cook
                     ` (3 more replies)
  2021-09-02 23:18 ` [PATCH v9 3/3] mm: add anonymous vma name refcounting Suren Baghdasaryan
  2021-09-03  0:28 ` [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
  2 siblings, 4 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-02 23:18 UTC (permalink / raw)
  To: akpm
  Cc: ccross, sumit.semwal, mhocko, dave.hansen, keescook, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team, surenb

From: Colin Cross <ccross@google.com>

In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in use.
 At a minimum there is libc malloc and the stack, and in many cases there
are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
multiple VM heaps (one for small objects, one for big objects, etc.).
Each of these layers usually has its own tools to inspect its usage;
malloc by compiling a debug version, the VM through heap inspection tools,
and for direct syscalls there is usually no way to track them.

On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
in userspace and slice their usage by process, shared (COW) vs.  unique
mappings, backing, etc.  This can account for real physical memory usage
even in cases like fork without exec (which Android uses heavily to share
as many private COW pages as possible between processes), Kernel SamePage
Merging, and clean zero pages.  It produces a measurement of the pages
that only exist in that process (USS, for unique), and a measurement of
the physical memory usage of that process with the cost of shared pages
being evenly split between processes that share them (PSS).

If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap walking
tool that can understand the heap debugging of every layer, or for every
layer's heap debugging tools to implement the pagemap walking logic, in
which case it is hard to get a consistent view of memory across the whole
system.

Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that
somebody needs to clean up on crashes.  It needs to be readable while
the process is still running, so it has to have some sort of
synchronization with every layer of userspace.  Efficiently tracking
the ranges requires reimplementing something like the kernel vma
trees, and linking to it from every layer of userspace.  It requires
more memory, more syscalls, more runtime cost, and more complexity to
separately track regions that the kernel is already tracking.

This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas.  The names of named anonymous
vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it. The name length limit is 256 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[','\' and ']'.

The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.

The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last upstreaming
attempt, Kees Cook raised concerns [2] about this approach and suggested
to copy the name into kernel memory space, perform validity checks [3]
and store as a string referenced from vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with worst-case
scenario of forking a process with 64k vmas having longest possible names
[4]. I ran this experiment on an ARM64 Android device and recorded a
worst-case regression of almost 40% when forking such a process. This
regression is addressed in the followup patch which replaces the pointer
to a name with a refcounted structure that allows sharing the name pointer
between vmas of the same name. Instead of duplicating the string during
fork() or when splitting a vma it increments the refcount.

[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

Changes for prctl(2) manual page (in the options section):

PR_SET_VMA
	Sets an attribute specified in arg2 for virtual memory areas
	starting from the address specified in arg3 and spanning the
	size specified	in arg4. arg5 specifies the value of the attribute
	to be set. Note that assigning an attribute to a virtual memory
	area might prevent it from being merged with adjacent virtual
	memory areas due to the difference in that attribute's value.

	Currently, arg2 must be one of:

	PR_SET_VMA_ANON_NAME
		Set a name for anonymous virtual memory areas. arg5 should
		be a pointer to a null-terminated string containing the
		name. The name length including null byte cannot exceed
		256 bytes. If arg5 is NULL, the name of the appropriate
		anonymous virtual memory areas will be reset.

Signed-off-by: Colin Cross <ccross@google.com>
[surenb: rebased over v5.14-rc7, replaced userpointer with a kernel copy
and added input sanitization. The bulk of the work here was done by Colin
Cross, therefore, with his permission, keeping him as the author]
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
previous version including cover letter with test results is at:
https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/

changes in v9
- Added documentation for prctl(2) manual page describing newly introduced
options, per Pavel Machek
- Documented the downside of naming an anonymous vma which might prevent
it from being merged with adjacent vmas, per Cyrill Gorcunov
- Replaced seq_puts+seq_write with seq_printf, per Kees Cook
- Changed name validation to allow only printable ascii characters, except for
'[', '\' and ']', per Rasmus Villemoes
- Added madvise_set_anon_name definition dependency on CONFIG_PROC_FS,
per Michal Hocko
- Added NULL check for the name input in prctl_set_vma to correctly handle this
case, per Michal Hocko
- Handle the possibility of kstrdup returning NULL, per Rolf Eike Beer
- Changed max anon vma name length from 64 to 256 (as in the original patch)
because I found one case of the name length being 139 bytes. If anyone is
curious, here it is:
dalvik-/data/dalvik-cache/arm64/apex@com.android.permission@priv-app@GooglePermissionController@GooglePermissionController.apk@classes.art

 Documentation/filesystems/proc.rst |   2 +
 fs/proc/task_mmu.c                 |  12 ++-
 fs/userfaultfd.c                   |   7 +-
 include/linux/mm.h                 |  13 ++-
 include/linux/mm_types.h           |  48 ++++++++++-
 include/uapi/linux/prctl.h         |   3 +
 kernel/fork.c                      |   2 +
 kernel/sys.c                       |  61 ++++++++++++++
 mm/madvise.c                       | 131 ++++++++++++++++++++++++++---
 mm/mempolicy.c                     |   3 +-
 mm/mlock.c                         |   2 +-
 mm/mmap.c                          |  38 +++++----
 mm/mprotect.c                      |   2 +-
 13 files changed, 283 insertions(+), 41 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 042c418f4090..a067eec54ef1 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -431,6 +431,8 @@ is not associated with a file:
  [stack]                    the stack of the main process
  [vdso]                     the "virtual dynamic shared object",
                             the kernel system call handler
+[anon:<name>]               an anonymous mapping that has been
+                            named by userspace
  =======                    ====================================
 
  or if empty, the mapping is anonymous.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..d41edb4b4540 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -308,6 +308,8 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 
 	name = arch_vma_name(vma);
 	if (!name) {
+		const char *anon_name;
+
 		if (!mm) {
 			name = "[vdso]";
 			goto done;
@@ -319,8 +321,16 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 			goto done;
 		}
 
-		if (is_stack(vma))
+		if (is_stack(vma)) {
 			name = "[stack]";
+			goto done;
+		}
+
+		anon_name = vma_anon_name(vma);
+		if (anon_name) {
+			seq_pad(m, ' ');
+			seq_printf(m, "[anon:%s]", anon_name);
+		}
 	}
 
 done:
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5c2d806e6ae5..5057843fb71a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -876,7 +876,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 				 new_flags, vma->anon_vma,
 				 vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 NULL_VM_UFFD_CTX);
+				 NULL_VM_UFFD_CTX, vma_anon_name(vma));
 		if (prev)
 			vma = prev;
 		else
@@ -1440,7 +1440,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 ((struct vm_userfaultfd_ctx){ ctx }));
+				 ((struct vm_userfaultfd_ctx){ ctx }),
+				 vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			goto next;
@@ -1617,7 +1618,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 NULL_VM_UFFD_CTX);
+				 NULL_VM_UFFD_CTX, vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			goto next;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e59646a5d44d..c72226215f33 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2550,7 +2550,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *, struct vm_userfaultfd_ctx);
+	struct mempolicy *, struct vm_userfaultfd_ctx, const char *);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
 	unsigned long addr, int new_below);
@@ -3285,5 +3285,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
 	return 0;
 }
 
+#if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_PROC_FS)
+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+			  unsigned long len_in, const char *name);
+#else
+static inline int
+madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+		      unsigned long len_in, const char *name) {
+	return 0;
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f8ee09c711f..968a1d0463d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -350,11 +350,19 @@ struct vm_area_struct {
 	/*
 	 * For areas with an address space and backing store,
 	 * linkage into the address_space->i_mmap interval tree.
+	 *
+	 * For private anonymous mappings, a pointer to a null terminated string
+	 * containing the name given to the vma, or NULL if unnamed.
 	 */
-	struct {
-		struct rb_node rb;
-		unsigned long rb_subtree_last;
-	} shared;
+
+	union {
+		struct {
+			struct rb_node rb;
+			unsigned long rb_subtree_last;
+		} shared;
+		/* Serialized by mmap_sem. */
+		char *anon_name;
+	};
 
 	/*
 	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
@@ -809,4 +817,36 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * mmap_lock should be read-locked when calling vma_anon_name() and while using
+ * the returned pointer.
+ */
+extern const char *vma_anon_name(struct vm_area_struct *vma);
+
+/*
+ * mmap_lock should be read-locked for orig_vma->vm_mm.
+ * mmap_lock should be write-locked for new_vma->vm_mm or new_vma should be
+ * isolated.
+ */
+extern void dup_vma_anon_name(struct vm_area_struct *orig_vma,
+			      struct vm_area_struct *new_vma);
+
+/*
+ * mmap_lock should be write-locked or vma should have been isolated under
+ * write-locked mmap_lock protection.
+ */
+extern void free_vma_anon_name(struct vm_area_struct *vma);
+
+/* mmap_lock should be read-locked */
+static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
+					 const char *name)
+{
+	const char *vma_name = vma_anon_name(vma);
+
+	if (likely(!vma_name))
+		return name == NULL;
+
+	return name && !strcmp(name, vma_name);
+}
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43bd7f713c39..4c8cbf510b2d 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -269,4 +269,7 @@ struct prctl_mm_map {
 # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
 # define PR_SCHED_CORE_MAX		4
 
+#define PR_SET_VMA		0x53564d41
+# define PR_SET_VMA_ANON_NAME		0
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 695d1343a254..cfb8c47564d8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -366,12 +366,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
 		new->vm_next = new->vm_prev = NULL;
+		dup_vma_anon_name(orig, new);
 	}
 	return new;
 }
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+	free_vma_anon_name(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 72c7639e3c98..25118902a376 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
 
 #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
 
+#ifdef CONFIG_MMU
+
+#define ANON_VMA_NAME_MAX_LEN	256
+
+static inline bool is_valid_name_char(char ch)
+{
+	/* printable ascii characters, except [ \ ] */
+	return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
+}
+
+static int prctl_set_vma(unsigned long opt, unsigned long addr,
+			 unsigned long size, unsigned long arg)
+{
+	struct mm_struct *mm = current->mm;
+	const char __user *uname;
+	char *name, *pch;
+	int error;
+
+	switch (opt) {
+	case PR_SET_VMA_ANON_NAME:
+		uname = (const char __user *)arg;
+		if (!uname) {
+			/* Reset the name */
+			name = NULL;
+			goto set_name;
+		}
+
+		name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
+
+		if (IS_ERR(name))
+			return PTR_ERR(name);
+
+		for (pch = name; *pch != '\0'; pch++) {
+			if (!is_valid_name_char(*pch)) {
+				kfree(name);
+				return -EINVAL;
+			}
+		}
+set_name:
+		mmap_write_lock(mm);
+		error = madvise_set_anon_name(mm, addr, size, name);
+		mmap_write_unlock(mm);
+		kfree(name);
+		break;
+	default:
+		error = -EINVAL;
+	}
+
+	return error;
+}
+#else /* CONFIG_MMU */
+static int prctl_set_vma(unsigned long opt, unsigned long start,
+			 unsigned long size, unsigned long arg)
+{
+	return -EINVAL;
+}
+#endif
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2568,6 +2626,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
 		break;
 #endif
+	case PR_SET_VMA:
+		error = prctl_set_vma(arg2, arg3, arg4, arg5);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/mm/madvise.c b/mm/madvise.c
index 54bf9f73f95d..0c6d0f64d432 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -18,6 +18,7 @@
 #include <linux/fadvise.h>
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
+#include <linux/string.h>
 #include <linux/uio.h>
 #include <linux/ksm.h>
 #include <linux/fs.h>
@@ -62,19 +63,78 @@ static int madvise_need_mmap_write(int behavior)
 	}
 }
 
+static inline bool has_vma_anon_name(struct vm_area_struct *vma)
+{
+	return !vma->vm_file && vma->anon_name;
+}
+
+const char *vma_anon_name(struct vm_area_struct *vma)
+{
+	if (!has_vma_anon_name(vma))
+		return NULL;
+
+	mmap_assert_locked(vma->vm_mm);
+
+	return vma->anon_name;
+}
+
+void dup_vma_anon_name(struct vm_area_struct *orig_vma,
+		       struct vm_area_struct *new_vma)
+{
+	if (!has_vma_anon_name(orig_vma))
+		return;
+
+	new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
+}
+
+void free_vma_anon_name(struct vm_area_struct *vma)
+{
+	if (!has_vma_anon_name(vma))
+		return;
+
+	kfree(vma->anon_name);
+	vma->anon_name = NULL;
+}
+
+/* mmap_lock should be write-locked */
+static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
+{
+	if (!name) {
+		free_vma_anon_name(vma);
+		return 0;
+	}
+
+	if (vma->anon_name) {
+		/* Should never happen, to dup use dup_vma_anon_name() */
+		WARN_ON(vma->anon_name == name);
+
+		/* Same name, nothing to do here */
+		if (!strcmp(name, vma->anon_name))
+			return 0;
+
+		free_vma_anon_name(vma);
+	}
+	vma->anon_name = kstrdup(name, GFP_KERNEL);
+	if (!vma->anon_name)
+		return -ENOMEM;
+
+	return 0;
+}
+
 /*
- * Update the vm_flags on regiion of a vma, splitting it or merging it as
+ * Update the vm_flags on region of a vma, splitting it or merging it as
  * necessary.  Must be called with mmap_sem held for writing;
  */
 static int madvise_update_vma(struct vm_area_struct *vma,
 			      struct vm_area_struct **prev, unsigned long start,
-			      unsigned long end, unsigned long new_flags)
+			      unsigned long end, unsigned long new_flags,
+			      const char *name)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int error;
 	pgoff_t pgoff;
 
-	if (new_flags == vma->vm_flags) {
+	if (new_flags == vma->vm_flags && is_same_vma_anon_name(vma, name)) {
 		*prev = vma;
 		return 0;
 	}
@@ -82,7 +142,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma),
-			  vma->vm_userfaultfd_ctx);
+			  vma->vm_userfaultfd_ctx, name);
 	if (*prev) {
 		vma = *prev;
 		goto success;
@@ -91,20 +151,16 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	*prev = vma;
 
 	if (start != vma->vm_start) {
-		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
-			error = -ENOMEM;
-			return error;
-		}
+		if (unlikely(mm->map_count >= sysctl_max_map_count))
+			return -ENOMEM;
 		error = __split_vma(mm, vma, start, 1);
 		if (error)
 			return error;
 	}
 
 	if (end != vma->vm_end) {
-		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
-			error = -ENOMEM;
-			return error;
-		}
+		if (unlikely(mm->map_count >= sysctl_max_map_count))
+			return -ENOMEM;
 		error = __split_vma(mm, vma, end, 0);
 		if (error)
 			return error;
@@ -115,10 +171,33 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	 * vm_flags is protected by the mmap_lock held in write mode.
 	 */
 	vma->vm_flags = new_flags;
+	if (!vma->vm_file) {
+		error = replace_vma_anon_name(vma, name);
+		if (error)
+			return error;
+	}
 
 	return 0;
 }
 
+static int madvise_vma_anon_name(struct vm_area_struct *vma,
+				 struct vm_area_struct **prev,
+				 unsigned long start, unsigned long end,
+				 unsigned long name)
+{
+	int error;
+
+	/* Only anonymous mappings can be named */
+	if (vma->vm_file)
+		return -EINVAL;
+
+	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
+				   (const char *)name);
+	if (error == -ENOMEM)
+		error = -EAGAIN;
+	return error;
+}
+
 #ifdef CONFIG_SWAP
 static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	unsigned long end, struct mm_walk *walk)
@@ -942,7 +1021,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		break;
 	}
 
-	error = madvise_update_vma(vma, prev, start, end, new_flags);
+	error = madvise_update_vma(vma, prev, start, end, new_flags,
+				   vma_anon_name(vma));
 
 out:
 	/*
@@ -1121,6 +1201,31 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 	return unmapped_error;
 }
 
+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+			  unsigned long len_in, const char *name)
+{
+	unsigned long end;
+	unsigned long len;
+
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+	/* Check to see whether len was rounded up from small -ve to zero */
+	if (len_in && !len)
+		return -EINVAL;
+
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	return madvise_walk_vmas(mm, start, end, (unsigned long)name,
+				 madvise_vma_anon_name);
+}
+
 /*
  * The madvise(2) system call.
  *
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..cc21ca7e9d40 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -811,7 +811,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
 				 vma->anon_vma, vma->vm_file, pgoff,
-				 new_pol, vma->vm_userfaultfd_ctx);
+				 new_pol, vma->vm_userfaultfd_ctx,
+				 vma_anon_name(vma));
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 16d2ee160d43..c878515680af 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -511,7 +511,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma),
-			  vma->vm_userfaultfd_ctx);
+			  vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 181a113b545d..c13934d41f65 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1032,7 +1032,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 				struct file *file, unsigned long vm_flags,
-				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+				const char *anon_name)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -1050,6 +1051,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
 		return 0;
+	if (!is_same_vma_anon_name(vma, anon_name))
+		return 0;
 	return 1;
 }
 
@@ -1082,9 +1085,10 @@ static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
 		     struct anon_vma *anon_vma, struct file *file,
 		     pgoff_t vm_pgoff,
-		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+		     const char *anon_name)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -1103,9 +1107,10 @@ static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 		    struct anon_vma *anon_vma, struct file *file,
 		    pgoff_t vm_pgoff,
-		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+		     const char *anon_name)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1116,9 +1121,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 }
 
 /*
- * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
- * whether that can be merged with its predecessor or its successor.
- * Or both (it neatly fills a hole).
+ * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
+ * figure out whether that can be merged with its predecessor or its
+ * successor.  Or both (it neatly fills a hole).
  *
  * In most cases - when called for mmap, brk or mremap - [addr,end) is
  * certain not to be mapped by the time vma_merge is called; but when
@@ -1163,7 +1168,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			unsigned long end, unsigned long vm_flags,
 			struct anon_vma *anon_vma, struct file *file,
 			pgoff_t pgoff, struct mempolicy *policy,
-			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+			const char *anon_name)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1193,7 +1199,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
 					    anon_vma, file, pgoff,
-					    vm_userfaultfd_ctx)) {
+					    vm_userfaultfd_ctx, anon_name)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
@@ -1202,7 +1208,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 				can_vma_merge_before(next, vm_flags,
 						     anon_vma, file,
 						     pgoff+pglen,
-						     vm_userfaultfd_ctx) &&
+						     vm_userfaultfd_ctx, anon_name) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1225,7 +1231,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
 					     anon_vma, file, pgoff+pglen,
-					     vm_userfaultfd_ctx)) {
+					     vm_userfaultfd_ctx, anon_name)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = __vma_adjust(prev, prev->vm_start,
 					 addr, prev->vm_pgoff, NULL, next);
@@ -1760,7 +1766,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Can we just expand an old mapping?
 	 */
 	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
+			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 	if (vma)
 		goto out;
 
@@ -1819,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && prev)) {
 			merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
+				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 			if (merge) {
 				/* ->mmap() can change vma->vm_file and fput the original file. So
 				 * fput the vma->vm_file here or we would add an extra fput for file
@@ -3081,7 +3087,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
+			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
 	if (vma)
 		goto out;
 
@@ -3274,7 +3280,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
 			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			    vma->vm_userfaultfd_ctx);
+			    vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..a48ff8e79f48 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -464,7 +464,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
 			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			   vma->vm_userfaultfd_ctx);
+			   vma->vm_userfaultfd_ctx, vma_anon_name(vma));
 	if (*pprev) {
 		vma = *pprev;
 		VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v9 3/3] mm: add anonymous vma name refcounting
  2021-09-02 23:18 [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
@ 2021-09-02 23:18 ` Suren Baghdasaryan
  2021-09-03 22:20   ` Kees Cook
  2021-09-03  0:28 ` [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
  2 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-02 23:18 UTC (permalink / raw)
  To: akpm
  Cc: ccross, sumit.semwal, mhocko, dave.hansen, keescook, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team, surenb

While forking a process with high number (64K) of named anonymous vmas the
overhead caused by strdup() is noticeable. Experiments with ARM64 Android
device show up to 40% performance regression when forking a process with
64k unpopulated anonymous vmas using the max name lengths vs the same
process with the same number of anonymous vmas having no name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member of
anon_vma_name structure is assigned at structure allocation time and is
never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated
to hold the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
previous version including cover letter with test results is at:
https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/

changes in v9
- Replaced kzalloc with kmalloc in anon_vma_name_alloc, per Rolf Eike Beer

 include/linux/mm_types.h |  9 ++++++++-
 mm/madvise.c             | 43 +++++++++++++++++++++++++++++++++-------
 2 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 968a1d0463d8..7feb43daee6c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@
 #include <linux/mm_types_task.h>
 
 #include <linux/auxvec.h>
+#include <linux/kref.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
@@ -310,6 +311,12 @@ struct vm_userfaultfd_ctx {
 struct vm_userfaultfd_ctx {};
 #endif /* CONFIG_USERFAULTFD */
 
+struct anon_vma_name {
+	struct kref kref;
+	/* The name needs to be at the end because it is dynamically sized. */
+	char name[];
+};
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -361,7 +368,7 @@ struct vm_area_struct {
 			unsigned long rb_subtree_last;
 		} shared;
 		/* Serialized by mmap_sem. */
-		char *anon_name;
+		struct anon_vma_name *anon_name;
 	};
 
 	/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 0c6d0f64d432..adc53edd3fe7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -63,6 +63,28 @@ static int madvise_need_mmap_write(int behavior)
 	}
 }
 
+static struct anon_vma_name *anon_vma_name_alloc(const char *name)
+{
+	struct anon_vma_name *anon_name;
+	size_t len = strlen(name);
+
+	/* Add 1 for NUL terminator at the end of the anon_name->name */
+	anon_name = kmalloc(sizeof(*anon_name) + len + 1, GFP_KERNEL);
+	if (anon_name) {
+		kref_init(&anon_name->kref);
+		strcpy(anon_name->name, name);
+	}
+
+	return anon_name;
+}
+
+static void vma_anon_name_free(struct kref *kref)
+{
+	struct anon_vma_name *anon_name =
+			container_of(kref, struct anon_vma_name, kref);
+	kfree(anon_name);
+}
+
 static inline bool has_vma_anon_name(struct vm_area_struct *vma)
 {
 	return !vma->vm_file && vma->anon_name;
@@ -75,7 +97,7 @@ const char *vma_anon_name(struct vm_area_struct *vma)
 
 	mmap_assert_locked(vma->vm_mm);
 
-	return vma->anon_name;
+	return vma->anon_name->name;
 }
 
 void dup_vma_anon_name(struct vm_area_struct *orig_vma,
@@ -84,37 +106,44 @@ void dup_vma_anon_name(struct vm_area_struct *orig_vma,
 	if (!has_vma_anon_name(orig_vma))
 		return;
 
-	new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
+	kref_get(&orig_vma->anon_name->kref);
+	new_vma->anon_name = orig_vma->anon_name;
 }
 
 void free_vma_anon_name(struct vm_area_struct *vma)
 {
+	struct anon_vma_name *anon_name;
+
 	if (!has_vma_anon_name(vma))
 		return;
 
-	kfree(vma->anon_name);
+	anon_name = vma->anon_name;
 	vma->anon_name = NULL;
+	kref_put(&anon_name->kref, vma_anon_name_free);
 }
 
 /* mmap_lock should be write-locked */
 static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
 {
+	const char *anon_name;
+
 	if (!name) {
 		free_vma_anon_name(vma);
 		return 0;
 	}
 
-	if (vma->anon_name) {
+	anon_name = vma_anon_name(vma);
+	if (anon_name) {
 		/* Should never happen, to dup use dup_vma_anon_name() */
-		WARN_ON(vma->anon_name == name);
+		WARN_ON(anon_name == name);
 
 		/* Same name, nothing to do here */
-		if (!strcmp(name, vma->anon_name))
+		if (!strcmp(name, anon_name))
 			return 0;
 
 		free_vma_anon_name(vma);
 	}
-	vma->anon_name = kstrdup(name, GFP_KERNEL);
+	vma->anon_name = anon_vma_name_alloc(name);
 	if (!vma->anon_name)
 		return -ENOMEM;
 
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse
  2021-09-02 23:18 [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
  2021-09-02 23:18 ` [PATCH v9 3/3] mm: add anonymous vma name refcounting Suren Baghdasaryan
@ 2021-09-03  0:28 ` Suren Baghdasaryan
  2 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-03  0:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Colin Cross, Sumit Semwal, Michal Hocko, Dave Hansen, Kees Cook,
	Matthew Wilcox, Kirill A . Shutemov, Vlastimil Babka,
	Johannes Weiner, Jonathan Corbet, Al Viro, Randy Dunlap,
	Kalesh Singh, Peter Xu, rppt, Peter Zijlstra, Catalin Marinas,
	vincenzo.frascino, Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team, Pekka Enberg, Ingo Molnar,
	Oleg Nesterov, Jan Glauber, John Stultz, Rob Landley,
	Cyrill Gorcunov, Serge E. Hallyn, David Rientjes, Mel Gorman,
	Shaohua Li, Minchan Kim

On Thu, Sep 2, 2021 at 4:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> From: Colin Cross <ccross@google.com>
>
> Refactor the madvise syscall to allow for parts of it to be reused by a
> prctl syscall that affects vmas.
>
> Move the code that walks vmas in a virtual address range into a function
> that takes a function pointer as a parameter.  The only caller for now is
> sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
> the next patch will add an additional caller.
>
> Move handling all vma behaviors inside madvise_behavior, and rename it to
> madvise_vma_behavior.
>
> Move the code that updates the flags on a vma, including splitting or
> merging the vma as necessary, into a new function called
> madvise_update_vma.  The next patch will add support for updating a new
> anon_name field as well.
>
> Signed-off-by: Colin Cross <ccross@google.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Jan Glauber <jan.glauber@gmail.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Rob Landley <rob@landley.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Shaohua Li <shli@fusionio.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>   [sumits: rebased over v5.9-rc3]
> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
>   [surenb: rebased over v5.14-rc7]
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> previous version including cover letter with test results is at:
> https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/
>
> changes in v9
> - Removed unnecessary initialization of 'error' to 0 in madvise_vma_behavior,
> per Cyrill Gorcunov
> - Replaced goto's with returns in madvise_vma_behavior, per Cyrill Gorcunov
> - Recovered the comment explaining why we map ENOMEM to EAGAIN in
> madvise_vma_behavior, per Cyrill Gorcunov
>
>  mm/madvise.c | 317 +++++++++++++++++++++++++++------------------------
>  1 file changed, 170 insertions(+), 147 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56324a3dbc4e..54bf9f73f95d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -63,76 +63,20 @@ static int madvise_need_mmap_write(int behavior)
>  }
>
>  /*
> - * We can potentially split a vm area into separate
> - * areas, each area with its own behavior.
> + * Update the vm_flags on regiion of a vma, splitting it or merging it as
> + * necessary.  Must be called with mmap_sem held for writing;
>   */
> -static long madvise_behavior(struct vm_area_struct *vma,
> -                    struct vm_area_struct **prev,
> -                    unsigned long start, unsigned long end, int behavior)
> +static int madvise_update_vma(struct vm_area_struct *vma,
> +                             struct vm_area_struct **prev, unsigned long start,
> +                             unsigned long end, unsigned long new_flags)
>  {
>         struct mm_struct *mm = vma->vm_mm;
> -       int error = 0;
> +       int error;
>         pgoff_t pgoff;
> -       unsigned long new_flags = vma->vm_flags;
> -
> -       switch (behavior) {
> -       case MADV_NORMAL:
> -               new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> -               break;
> -       case MADV_SEQUENTIAL:
> -               new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> -               break;
> -       case MADV_RANDOM:
> -               new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> -               break;
> -       case MADV_DONTFORK:
> -               new_flags |= VM_DONTCOPY;
> -               break;
> -       case MADV_DOFORK:
> -               if (vma->vm_flags & VM_IO) {
> -                       error = -EINVAL;
> -                       goto out;
> -               }
> -               new_flags &= ~VM_DONTCOPY;
> -               break;
> -       case MADV_WIPEONFORK:
> -               /* MADV_WIPEONFORK is only supported on anonymous memory. */
> -               if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> -                       error = -EINVAL;
> -                       goto out;
> -               }
> -               new_flags |= VM_WIPEONFORK;
> -               break;
> -       case MADV_KEEPONFORK:
> -               new_flags &= ~VM_WIPEONFORK;
> -               break;
> -       case MADV_DONTDUMP:
> -               new_flags |= VM_DONTDUMP;
> -               break;
> -       case MADV_DODUMP:
> -               if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> -                       error = -EINVAL;
> -                       goto out;
> -               }
> -               new_flags &= ~VM_DONTDUMP;
> -               break;
> -       case MADV_MERGEABLE:
> -       case MADV_UNMERGEABLE:
> -               error = ksm_madvise(vma, start, end, behavior, &new_flags);
> -               if (error)
> -                       goto out_convert_errno;
> -               break;
> -       case MADV_HUGEPAGE:
> -       case MADV_NOHUGEPAGE:
> -               error = hugepage_madvise(vma, &new_flags, behavior);
> -               if (error)
> -                       goto out_convert_errno;
> -               break;
> -       }
>
>         if (new_flags == vma->vm_flags) {
>                 *prev = vma;
> -               goto out;
> +               return 0;
>         }
>
>         pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> @@ -149,21 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
>         if (start != vma->vm_start) {
>                 if (unlikely(mm->map_count >= sysctl_max_map_count)) {
>                         error = -ENOMEM;
> -                       goto out;
> +                       return error;

Oh, I missed this one. Should be simply:
-                       error = -ENOMEM;
-                       goto out;
+                       return -ENOMEM;


>                 }
>                 error = __split_vma(mm, vma, start, 1);
>                 if (error)
> -                       goto out_convert_errno;
> +                       return error;
>         }
>
>         if (end != vma->vm_end) {
>                 if (unlikely(mm->map_count >= sysctl_max_map_count)) {
>                         error = -ENOMEM;
> -                       goto out;
> +                       return error;

same here.

>                 }
>                 error = __split_vma(mm, vma, end, 0);
>                 if (error)
> -                       goto out_convert_errno;
> +                       return error;
>         }
>
>  success:
> @@ -172,15 +116,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>          */
>         vma->vm_flags = new_flags;
>
> -out_convert_errno:
> -       /*
> -        * madvise() returns EAGAIN if kernel resources, such as
> -        * slab, are temporarily unavailable.
> -        */
> -       if (error == -ENOMEM)
> -               error = -EAGAIN;
> -out:
> -       return error;
> +       return 0;
>  }
>
>  #ifdef CONFIG_SWAP
> @@ -930,6 +866,94 @@ static long madvise_remove(struct vm_area_struct *vma,
>         return error;
>  }
>
> +/*
> + * Apply an madvise behavior to a region of a vma.  madvise_update_vma
> + * will handle splitting a vm area into separate areas, each area with its own
> + * behavior.
> + */
> +static int madvise_vma_behavior(struct vm_area_struct *vma,
> +                               struct vm_area_struct **prev,
> +                               unsigned long start, unsigned long end,
> +                               unsigned long behavior)
> +{
> +       int error;
> +       unsigned long new_flags = vma->vm_flags;
> +
> +       switch (behavior) {
> +       case MADV_REMOVE:
> +               return madvise_remove(vma, prev, start, end);
> +       case MADV_WILLNEED:
> +               return madvise_willneed(vma, prev, start, end);
> +       case MADV_COLD:
> +               return madvise_cold(vma, prev, start, end);
> +       case MADV_PAGEOUT:
> +               return madvise_pageout(vma, prev, start, end);
> +       case MADV_FREE:
> +       case MADV_DONTNEED:
> +               return madvise_dontneed_free(vma, prev, start, end, behavior);
> +       case MADV_POPULATE_READ:
> +       case MADV_POPULATE_WRITE:
> +               return madvise_populate(vma, prev, start, end, behavior);
> +       case MADV_NORMAL:
> +               new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> +               break;
> +       case MADV_SEQUENTIAL:
> +               new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> +               break;
> +       case MADV_RANDOM:
> +               new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> +               break;
> +       case MADV_DONTFORK:
> +               new_flags |= VM_DONTCOPY;
> +               break;
> +       case MADV_DOFORK:
> +               if (vma->vm_flags & VM_IO)
> +                       return -EINVAL;
> +               new_flags &= ~VM_DONTCOPY;
> +               break;
> +       case MADV_WIPEONFORK:
> +               /* MADV_WIPEONFORK is only supported on anonymous memory. */
> +               if (vma->vm_file || vma->vm_flags & VM_SHARED)
> +                       return -EINVAL;
> +               new_flags |= VM_WIPEONFORK;
> +               break;
> +       case MADV_KEEPONFORK:
> +               new_flags &= ~VM_WIPEONFORK;
> +               break;
> +       case MADV_DONTDUMP:
> +               new_flags |= VM_DONTDUMP;
> +               break;
> +       case MADV_DODUMP:
> +               if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL)
> +                       return -EINVAL;
> +               new_flags &= ~VM_DONTDUMP;
> +               break;
> +       case MADV_MERGEABLE:
> +       case MADV_UNMERGEABLE:
> +               error = ksm_madvise(vma, start, end, behavior, &new_flags);
> +               if (error)
> +                       goto out;
> +               break;
> +       case MADV_HUGEPAGE:
> +       case MADV_NOHUGEPAGE:
> +               error = hugepage_madvise(vma, &new_flags, behavior);
> +               if (error)
> +                       goto out;
> +               break;
> +       }
> +
> +       error = madvise_update_vma(vma, prev, start, end, new_flags);
> +
> +out:
> +       /*
> +        * madvise() returns EAGAIN if kernel resources, such as
> +        * slab, are temporarily unavailable.
> +        */
> +       if (error == -ENOMEM)
> +               error = -EAGAIN;
> +       return error;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Error injection support for memory error handling.
> @@ -978,30 +1002,6 @@ static int madvise_inject_error(int behavior,
>  }
>  #endif
>
> -static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -               unsigned long start, unsigned long end, int behavior)
> -{
> -       switch (behavior) {
> -       case MADV_REMOVE:
> -               return madvise_remove(vma, prev, start, end);
> -       case MADV_WILLNEED:
> -               return madvise_willneed(vma, prev, start, end);
> -       case MADV_COLD:
> -               return madvise_cold(vma, prev, start, end);
> -       case MADV_PAGEOUT:
> -               return madvise_pageout(vma, prev, start, end);
> -       case MADV_FREE:
> -       case MADV_DONTNEED:
> -               return madvise_dontneed_free(vma, prev, start, end, behavior);
> -       case MADV_POPULATE_READ:
> -       case MADV_POPULATE_WRITE:
> -               return madvise_populate(vma, prev, start, end, behavior);
> -       default:
> -               return madvise_behavior(vma, prev, start, end, behavior);
> -       }
> -}
> -
>  static bool
>  madvise_behavior_valid(int behavior)
>  {
> @@ -1054,6 +1054,73 @@ process_madvise_behavior_valid(int behavior)
>         }
>  }
>
> +/*
> + * Walk the vmas in range [start,end), and call the visit function on each one.
> + * The visit function will get start and end parameters that cover the overlap
> + * between the current vma and the original range.  Any unmapped regions in the
> + * original range will result in this function returning -ENOMEM while still
> + * calling the visit function on all of the existing vmas in the range.
> + * Must be called with the mmap_lock held for reading or writing.
> + */
> +static
> +int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> +                     unsigned long end, unsigned long arg,
> +                     int (*visit)(struct vm_area_struct *vma,
> +                                  struct vm_area_struct **prev, unsigned long start,
> +                                  unsigned long end, unsigned long arg))
> +{
> +       struct vm_area_struct *vma;
> +       struct vm_area_struct *prev;
> +       unsigned long tmp;
> +       int unmapped_error = 0;
> +
> +       /*
> +        * If the interval [start,end) covers some unmapped address
> +        * ranges, just ignore them, but return -ENOMEM at the end.
> +        * - different from the way of handling in mlock etc.
> +        */
> +       vma = find_vma_prev(mm, start, &prev);
> +       if (vma && start > vma->vm_start)
> +               prev = vma;
> +
> +       for (;;) {
> +               int error;
> +
> +               /* Still start < end. */
> +               if (!vma)
> +                       return -ENOMEM;
> +
> +               /* Here start < (end|vma->vm_end). */
> +               if (start < vma->vm_start) {
> +                       unmapped_error = -ENOMEM;
> +                       start = vma->vm_start;
> +                       if (start >= end)
> +                               break;
> +               }
> +
> +               /* Here vma->vm_start <= start < (end|vma->vm_end) */
> +               tmp = vma->vm_end;
> +               if (end < tmp)
> +                       tmp = end;
> +
> +               /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +               error = visit(vma, &prev, start, tmp, arg);
> +               if (error)
> +                       return error;
> +               start = tmp;
> +               if (prev && start < prev->vm_end)
> +                       start = prev->vm_end;
> +               if (start >= end)
> +                       break;
> +               if (prev)
> +                       vma = prev->vm_next;
> +               else    /* madvise_remove dropped mmap_lock */
> +                       vma = find_vma(mm, start);
> +       }
> +
> +       return unmapped_error;
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -1126,9 +1193,7 @@ process_madvise_behavior_valid(int behavior)
>   */
>  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
>  {
> -       unsigned long end, tmp;
> -       struct vm_area_struct *vma, *prev;
> -       int unmapped_error = 0;
> +       unsigned long end;
>         int error = -EINVAL;
>         int write;
>         size_t len;
> @@ -1168,51 +1233,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>                 mmap_read_lock(mm);
>         }
>
> -       /*
> -        * If the interval [start,end) covers some unmapped address
> -        * ranges, just ignore them, but return -ENOMEM at the end.
> -        * - different from the way of handling in mlock etc.
> -        */
> -       vma = find_vma_prev(mm, start, &prev);
> -       if (vma && start > vma->vm_start)
> -               prev = vma;
> -
>         blk_start_plug(&plug);
> -       for (;;) {
> -               /* Still start < end. */
> -               error = -ENOMEM;
> -               if (!vma)
> -                       goto out;
> -
> -               /* Here start < (end|vma->vm_end). */
> -               if (start < vma->vm_start) {
> -                       unmapped_error = -ENOMEM;
> -                       start = vma->vm_start;
> -                       if (start >= end)
> -                               goto out;
> -               }
> -
> -               /* Here vma->vm_start <= start < (end|vma->vm_end) */
> -               tmp = vma->vm_end;
> -               if (end < tmp)
> -                       tmp = end;
> -
> -               /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -               error = madvise_vma(vma, &prev, start, tmp, behavior);
> -               if (error)
> -                       goto out;
> -               start = tmp;
> -               if (prev && start < prev->vm_end)
> -                       start = prev->vm_end;
> -               error = unmapped_error;
> -               if (start >= end)
> -                       goto out;
> -               if (prev)
> -                       vma = prev->vm_next;
> -               else    /* madvise_remove dropped mmap_lock */
> -                       vma = find_vma(mm, start);
> -       }
> -out:
> +       error = madvise_walk_vmas(mm, start, end, behavior,
> +                       madvise_vma_behavior);
>         blk_finish_plug(&plug);
>         if (write)
>                 mmap_write_unlock(mm);
> --
> 2.33.0.153.gba50c8fa24-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
@ 2021-09-03 21:35   ` Kees Cook
  2021-09-03 21:51     ` Suren Baghdasaryan
  2021-09-05 13:04     ` Pavel Machek
  2021-09-03 21:47   ` Kees Cook
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Kees Cook @ 2021-09-03 21:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, ccross, sumit.semwal, mhocko, dave.hansen, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team

On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> From: Colin Cross <ccross@google.com>
> 
> In many userspace applications, and especially in VM based applications
> like Android uses heavily, there are multiple different allocators in use.
>  At a minimum there is libc malloc and the stack, and in many cases there
> are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> multiple VM heaps (one for small objects, one for big objects, etc.).
> Each of these layers usually has its own tools to inspect its usage;
> malloc by compiling a debug version, the VM through heap inspection tools,
> and for direct syscalls there is usually no way to track them.
> 
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.
> 
> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> userspace-provided name for anonymous vmas.  The names of named anonymous
> vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> 
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> Setting the name to NULL clears it. The name length limit is 256 bytes
> including NUL-terminator and is checked to contain only printable ascii
> characters (including space), except '[','\' and ']'.

Is the reason this isn't done via madvise() because we're forced into an
"int" argument there? (Otherwise we could pass a pointer there.)

	int madvise(void *addr, size_t length, int advice);

> 
> The name is stored in a pointer in the shared union in vm_area_struct
> that points to a null terminated string. Anonymous vmas with the same
> name (equivalent strings) and are otherwise mergeable will be merged.
> The name pointers are not shared between vmas even if they contain the
> same name. The name pointer is stored in a union with fields that are
> only used on file-backed mappings, so it does not increase memory usage.
> 
> The patch is based on the original patch developed by Colin Cross, more
> specifically on its latest version [1] posted upstream by Sumit Semwal.
> It used a userspace pointer to store vma names. In that design, name
> pointers could be shared between vmas. However during the last upstreaming
> attempt, Kees Cook raised concerns [2] about this approach and suggested
> to copy the name into kernel memory space, perform validity checks [3]
> and store as a string referenced from vm_area_struct.
> One big concern is about fork() performance which would need to strdup
> anonymous vma names. Dave Hansen suggested experimenting with worst-case
> scenario of forking a process with 64k vmas having longest possible names
> [4]. I ran this experiment on an ARM64 Android device and recorded a
> worst-case regression of almost 40% when forking such a process. This
> regression is addressed in the followup patch which replaces the pointer
> to a name with a refcounted structure that allows sharing the name pointer
> between vmas of the same name. Instead of duplicating the string during
> fork() or when splitting a vma it increments the refcount.
> 
> [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
> [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
> 
> Changes for prctl(2) manual page (in the options section):
> 
> PR_SET_VMA
> 	Sets an attribute specified in arg2 for virtual memory areas
> 	starting from the address specified in arg3 and spanning the
> 	size specified	in arg4. arg5 specifies the value of the attribute
> 	to be set. Note that assigning an attribute to a virtual memory
> 	area might prevent it from being merged with adjacent virtual
> 	memory areas due to the difference in that attribute's value.
> 
> 	Currently, arg2 must be one of:
> 
> 	PR_SET_VMA_ANON_NAME
> 		Set a name for anonymous virtual memory areas. arg5 should
> 		be a pointer to a null-terminated string containing the
> 		name. The name length including null byte cannot exceed
> 		256 bytes. If arg5 is NULL, the name of the appropriate
> 		anonymous virtual memory areas will be reset.
> 
> Signed-off-by: Colin Cross <ccross@google.com>
> [surenb: rebased over v5.14-rc7, replaced userpointer with a kernel copy
> and added input sanitization. The bulk of the work here was done by Colin
> Cross, therefore, with his permission, keeping him as the author]
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> previous version including cover letter with test results is at:
> https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/
> 
> changes in v9
> - Added documentation for prctl(2) manual page describing newly introduced
> options, per Pavel Machek
> - Documented the downside of naming an anonymous vma which might prevent
> it from being merged with adjacent vmas, per Cyrill Gorcunov
> - Replaced seq_puts+seq_write with seq_printf, per Kees Cook
> - Changed name validation to allow only printable ascii characters, except for
> '[', '\' and ']', per Rasmus Villemoes
> - Added madvise_set_anon_name definition dependency on CONFIG_PROC_FS,
> per Michal Hocko
> - Added NULL check for the name input in prctl_set_vma to correctly handle this
> case, per Michal Hocko
> - Handle the possibility of kstrdup returning NULL, per Rolf Eike Beer
> - Changed max anon vma name length from 64 to 256 (as in the original patch)
> because I found one case of the name length being 139 bytes. If anyone is
> curious, here it is:
> dalvik-/data/dalvik-cache/arm64/apex@com.android.permission@priv-app@GooglePermissionController@GooglePermissionController.apk@classes.art
> 
>  Documentation/filesystems/proc.rst |   2 +
>  fs/proc/task_mmu.c                 |  12 ++-
>  fs/userfaultfd.c                   |   7 +-
>  include/linux/mm.h                 |  13 ++-
>  include/linux/mm_types.h           |  48 ++++++++++-
>  include/uapi/linux/prctl.h         |   3 +
>  kernel/fork.c                      |   2 +
>  kernel/sys.c                       |  61 ++++++++++++++
>  mm/madvise.c                       | 131 ++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |   3 +-
>  mm/mlock.c                         |   2 +-
>  mm/mmap.c                          |  38 +++++----
>  mm/mprotect.c                      |   2 +-
>  13 files changed, 283 insertions(+), 41 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 042c418f4090..a067eec54ef1 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -431,6 +431,8 @@ is not associated with a file:
>   [stack]                    the stack of the main process
>   [vdso]                     the "virtual dynamic shared object",
>                              the kernel system call handler
> +[anon:<name>]               an anonymous mapping that has been
> +                            named by userspace
>   =======                    ====================================
>  
>   or if empty, the mapping is anonymous.
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index eb97468dfe4c..d41edb4b4540 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -308,6 +308,8 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
>  
>  	name = arch_vma_name(vma);
>  	if (!name) {
> +		const char *anon_name;
> +
>  		if (!mm) {
>  			name = "[vdso]";
>  			goto done;
> @@ -319,8 +321,16 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
>  			goto done;
>  		}
>  
> -		if (is_stack(vma))
> +		if (is_stack(vma)) {
>  			name = "[stack]";
> +			goto done;
> +		}
> +
> +		anon_name = vma_anon_name(vma);
> +		if (anon_name) {
> +			seq_pad(m, ' ');
> +			seq_printf(m, "[anon:%s]", anon_name);
> +		}
>  	}
>  
>  done:
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 5c2d806e6ae5..5057843fb71a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -876,7 +876,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
>  				 new_flags, vma->anon_vma,
>  				 vma->vm_file, vma->vm_pgoff,
>  				 vma_policy(vma),
> -				 NULL_VM_UFFD_CTX);
> +				 NULL_VM_UFFD_CTX, vma_anon_name(vma));
>  		if (prev)
>  			vma = prev;
>  		else
> @@ -1440,7 +1440,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  		prev = vma_merge(mm, prev, start, vma_end, new_flags,
>  				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
>  				 vma_policy(vma),
> -				 ((struct vm_userfaultfd_ctx){ ctx }));
> +				 ((struct vm_userfaultfd_ctx){ ctx }),
> +				 vma_anon_name(vma));
>  		if (prev) {
>  			vma = prev;
>  			goto next;
> @@ -1617,7 +1618,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  		prev = vma_merge(mm, prev, start, vma_end, new_flags,
>  				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
>  				 vma_policy(vma),
> -				 NULL_VM_UFFD_CTX);
> +				 NULL_VM_UFFD_CTX, vma_anon_name(vma));
>  		if (prev) {
>  			vma = prev;
>  			goto next;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e59646a5d44d..c72226215f33 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2550,7 +2550,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  extern struct vm_area_struct *vma_merge(struct mm_struct *,
>  	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>  	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> -	struct mempolicy *, struct vm_userfaultfd_ctx);
> +	struct mempolicy *, struct vm_userfaultfd_ctx, const char *);
>  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>  extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
>  	unsigned long addr, int new_below);
> @@ -3285,5 +3285,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +#if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_PROC_FS)
> +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> +			  unsigned long len_in, const char *name);
> +#else
> +static inline int
> +madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> +		      unsigned long len_in, const char *name) {
> +	return 0;
> +}
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7f8ee09c711f..968a1d0463d8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -350,11 +350,19 @@ struct vm_area_struct {
>  	/*
>  	 * For areas with an address space and backing store,
>  	 * linkage into the address_space->i_mmap interval tree.
> +	 *
> +	 * For private anonymous mappings, a pointer to a null terminated string
> +	 * containing the name given to the vma, or NULL if unnamed.
>  	 */
> -	struct {
> -		struct rb_node rb;
> -		unsigned long rb_subtree_last;
> -	} shared;
> +
> +	union {
> +		struct {
> +			struct rb_node rb;
> +			unsigned long rb_subtree_last;
> +		} shared;
> +		/* Serialized by mmap_sem. */
> +		char *anon_name;
> +	};
>  
>  	/*
>  	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
> @@ -809,4 +817,36 @@ typedef struct {
>  	unsigned long val;
>  } swp_entry_t;
>  
> +/*
> + * mmap_lock should be read-locked when calling vma_anon_name() and while using
> + * the returned pointer.
> + */
> +extern const char *vma_anon_name(struct vm_area_struct *vma);
> +
> +/*
> + * mmap_lock should be read-locked for orig_vma->vm_mm.
> + * mmap_lock should be write-locked for new_vma->vm_mm or new_vma should be
> + * isolated.
> + */
> +extern void dup_vma_anon_name(struct vm_area_struct *orig_vma,
> +			      struct vm_area_struct *new_vma);
> +
> +/*
> + * mmap_lock should be write-locked or vma should have been isolated under
> + * write-locked mmap_lock protection.
> + */
> +extern void free_vma_anon_name(struct vm_area_struct *vma);
> +
> +/* mmap_lock should be read-locked */
> +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> +					 const char *name)
> +{
> +	const char *vma_name = vma_anon_name(vma);
> +
> +	if (likely(!vma_name))
> +		return name == NULL;
> +
> +	return name && !strcmp(name, vma_name);
> +}
> +
>  #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43bd7f713c39..4c8cbf510b2d 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -269,4 +269,7 @@ struct prctl_mm_map {
>  # define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
>  # define PR_SCHED_CORE_MAX		4
>  
> +#define PR_SET_VMA		0x53564d41
> +# define PR_SET_VMA_ANON_NAME		0
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 695d1343a254..cfb8c47564d8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -366,12 +366,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  		*new = data_race(*orig);
>  		INIT_LIST_HEAD(&new->anon_vma_chain);
>  		new->vm_next = new->vm_prev = NULL;
> +		dup_vma_anon_name(orig, new);
>  	}
>  	return new;
>  }
>  
>  void vm_area_free(struct vm_area_struct *vma)
>  {
> +	free_vma_anon_name(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 72c7639e3c98..25118902a376 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>  
>  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
>  
> +#ifdef CONFIG_MMU
> +
> +#define ANON_VMA_NAME_MAX_LEN	256
> +
> +static inline bool is_valid_name_char(char ch)
> +{
> +	/* printable ascii characters, except [ \ ] */
> +	return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> +}
> +
> +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> +			 unsigned long size, unsigned long arg)
> +{
> +	struct mm_struct *mm = current->mm;
> +	const char __user *uname;
> +	char *name, *pch;
> +	int error;
> +
> +	switch (opt) {
> +	case PR_SET_VMA_ANON_NAME:
> +		uname = (const char __user *)arg;
> +		if (!uname) {
> +			/* Reset the name */
> +			name = NULL;
> +			goto set_name;
> +		}
> +
> +		name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
> +
> +		if (IS_ERR(name))
> +			return PTR_ERR(name);
> +
> +		for (pch = name; *pch != '\0'; pch++) {
> +			if (!is_valid_name_char(*pch)) {
> +				kfree(name);
> +				return -EINVAL;
> +			}
> +		}
> +set_name:
> +		mmap_write_lock(mm);
> +		error = madvise_set_anon_name(mm, addr, size, name);
> +		mmap_write_unlock(mm);
> +		kfree(name);
> +		break;
> +	default:
> +		error = -EINVAL;
> +	}
> +
> +	return error;
> +}
> +#else /* CONFIG_MMU */
> +static int prctl_set_vma(unsigned long opt, unsigned long start,
> +			 unsigned long size, unsigned long arg)
> +{
> +	return -EINVAL;
> +}
> +#endif
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2568,6 +2626,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
>  		break;
>  #endif
> +	case PR_SET_VMA:
> +		error = prctl_set_vma(arg2, arg3, arg4, arg5);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 54bf9f73f95d..0c6d0f64d432 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -18,6 +18,7 @@
>  #include <linux/fadvise.h>
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
> +#include <linux/string.h>
>  #include <linux/uio.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
> @@ -62,19 +63,78 @@ static int madvise_need_mmap_write(int behavior)
>  	}
>  }
>  
> +static inline bool has_vma_anon_name(struct vm_area_struct *vma)
> +{
> +	return !vma->vm_file && vma->anon_name;
> +}
> +
> +const char *vma_anon_name(struct vm_area_struct *vma)
> +{
> +	if (!has_vma_anon_name(vma))
> +		return NULL;
> +
> +	mmap_assert_locked(vma->vm_mm);
> +
> +	return vma->anon_name;
> +}
> +
> +void dup_vma_anon_name(struct vm_area_struct *orig_vma,
> +		       struct vm_area_struct *new_vma)
> +{
> +	if (!has_vma_anon_name(orig_vma))
> +		return;
> +
> +	new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
> +}
> +
> +void free_vma_anon_name(struct vm_area_struct *vma)
> +{
> +	if (!has_vma_anon_name(vma))
> +		return;
> +
> +	kfree(vma->anon_name);
> +	vma->anon_name = NULL;
> +}
> +
> +/* mmap_lock should be write-locked */
> +static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> +{
> +	if (!name) {
> +		free_vma_anon_name(vma);
> +		return 0;
> +	}
> +
> +	if (vma->anon_name) {
> +		/* Should never happen, to dup use dup_vma_anon_name() */
> +		WARN_ON(vma->anon_name == name);
> +
> +		/* Same name, nothing to do here */
> +		if (!strcmp(name, vma->anon_name))
> +			return 0;
> +
> +		free_vma_anon_name(vma);
> +	}
> +	vma->anon_name = kstrdup(name, GFP_KERNEL);
> +	if (!vma->anon_name)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
>  /*
> - * Update the vm_flags on regiion of a vma, splitting it or merging it as
> + * Update the vm_flags on region of a vma, splitting it or merging it as
>   * necessary.  Must be called with mmap_sem held for writing;
>   */
>  static int madvise_update_vma(struct vm_area_struct *vma,
>  			      struct vm_area_struct **prev, unsigned long start,
> -			      unsigned long end, unsigned long new_flags)
> +			      unsigned long end, unsigned long new_flags,
> +			      const char *name)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	int error;
>  	pgoff_t pgoff;
>  
> -	if (new_flags == vma->vm_flags) {
> +	if (new_flags == vma->vm_flags && is_same_vma_anon_name(vma, name)) {
>  		*prev = vma;
>  		return 0;
>  	}
> @@ -82,7 +142,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
>  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>  	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
>  			  vma->vm_file, pgoff, vma_policy(vma),
> -			  vma->vm_userfaultfd_ctx);
> +			  vma->vm_userfaultfd_ctx, name);
>  	if (*prev) {
>  		vma = *prev;
>  		goto success;
> @@ -91,20 +151,16 @@ static int madvise_update_vma(struct vm_area_struct *vma,
>  	*prev = vma;
>  
>  	if (start != vma->vm_start) {
> -		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> -			error = -ENOMEM;
> -			return error;
> -		}
> +		if (unlikely(mm->map_count >= sysctl_max_map_count))
> +			return -ENOMEM;
>  		error = __split_vma(mm, vma, start, 1);
>  		if (error)
>  			return error;
>  	}
>  
>  	if (end != vma->vm_end) {
> -		if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> -			error = -ENOMEM;
> -			return error;
> -		}
> +		if (unlikely(mm->map_count >= sysctl_max_map_count))
> +			return -ENOMEM;
>  		error = __split_vma(mm, vma, end, 0);
>  		if (error)
>  			return error;
> @@ -115,10 +171,33 @@ static int madvise_update_vma(struct vm_area_struct *vma,
>  	 * vm_flags is protected by the mmap_lock held in write mode.
>  	 */
>  	vma->vm_flags = new_flags;
> +	if (!vma->vm_file) {
> +		error = replace_vma_anon_name(vma, name);
> +		if (error)
> +			return error;
> +	}
>  
>  	return 0;
>  }
>  
> +static int madvise_vma_anon_name(struct vm_area_struct *vma,
> +				 struct vm_area_struct **prev,
> +				 unsigned long start, unsigned long end,
> +				 unsigned long name)
> +{
> +	int error;
> +
> +	/* Only anonymous mappings can be named */
> +	if (vma->vm_file)
> +		return -EINVAL;

To distinguish from the other EINVALs, should this maybe be EBADF? (As
in "no, you can't do that, there is already an fd associated".)

> +
> +	error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
> +				   (const char *)name);
> +	if (error == -ENOMEM)
> +		error = -EAGAIN;

I think a comment would be useful here too. AIUI, this is done to match
the error behavior seen under madvise_vma_behavior().

> +	return error;
> +}
> +
>  #ifdef CONFIG_SWAP
>  static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
>  	unsigned long end, struct mm_walk *walk)
> @@ -942,7 +1021,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		break;
>  	}
>  
> -	error = madvise_update_vma(vma, prev, start, end, new_flags);
> +	error = madvise_update_vma(vma, prev, start, end, new_flags,
> +				   vma_anon_name(vma));
>  
>  out:
>  	/*
> @@ -1121,6 +1201,31 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  	return unmapped_error;
>  }
>  
> +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> +			  unsigned long len_in, const char *name)
> +{
> +	unsigned long end;
> +	unsigned long len;
> +
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> +
> +	/* Check to see whether len was rounded up from small -ve to zero */
> +	if (len_in && !len)
> +		return -EINVAL;
> +
> +	end = start + len;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	if (end == start)
> +		return 0;
> +
> +	return madvise_walk_vmas(mm, start, end, (unsigned long)name,
> +				 madvise_vma_anon_name);
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index e32360e90274..cc21ca7e9d40 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -811,7 +811,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
>  			((vmstart - vma->vm_start) >> PAGE_SHIFT);
>  		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
>  				 vma->anon_vma, vma->vm_file, pgoff,
> -				 new_pol, vma->vm_userfaultfd_ctx);
> +				 new_pol, vma->vm_userfaultfd_ctx,
> +				 vma_anon_name(vma));
>  		if (prev) {
>  			vma = prev;
>  			next = vma->vm_next;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 16d2ee160d43..c878515680af 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -511,7 +511,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>  	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
>  			  vma->vm_file, pgoff, vma_policy(vma),
> -			  vma->vm_userfaultfd_ctx);
> +			  vma->vm_userfaultfd_ctx, vma_anon_name(vma));
>  	if (*prev) {
>  		vma = *prev;
>  		goto success;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 181a113b545d..c13934d41f65 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1032,7 +1032,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>   */
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  				struct file *file, unsigned long vm_flags,
> -				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +				struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +				const char *anon_name)
>  {
>  	/*
>  	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
> @@ -1050,6 +1051,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  		return 0;
>  	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
>  		return 0;
> +	if (!is_same_vma_anon_name(vma, anon_name))
> +		return 0;
>  	return 1;
>  }
>  
> @@ -1082,9 +1085,10 @@ static int
>  can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
>  		     struct anon_vma *anon_vma, struct file *file,
>  		     pgoff_t vm_pgoff,
> -		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +		     const char *anon_name)
>  {
> -	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
> +	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
>  	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
>  		if (vma->vm_pgoff == vm_pgoff)
>  			return 1;
> @@ -1103,9 +1107,10 @@ static int
>  can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
>  		    struct anon_vma *anon_vma, struct file *file,
>  		    pgoff_t vm_pgoff,
> -		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +		     const char *anon_name)

Nit: one too many spaces before "const".

>  {
> -	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
> +	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
>  	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
>  		pgoff_t vm_pglen;
>  		vm_pglen = vma_pages(vma);
> @@ -1116,9 +1121,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
>  }
>  
>  /*
> - * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
> - * whether that can be merged with its predecessor or its successor.
> - * Or both (it neatly fills a hole).
> + * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
> + * figure out whether that can be merged with its predecessor or its
> + * successor.  Or both (it neatly fills a hole).
>   *
>   * In most cases - when called for mmap, brk or mremap - [addr,end) is
>   * certain not to be mapped by the time vma_merge is called; but when
> @@ -1163,7 +1168,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  			unsigned long end, unsigned long vm_flags,
>  			struct anon_vma *anon_vma, struct file *file,
>  			pgoff_t pgoff, struct mempolicy *policy,
> -			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +			struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +			const char *anon_name)
>  {
>  	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
>  	struct vm_area_struct *area, *next;
> @@ -1193,7 +1199,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  			mpol_equal(vma_policy(prev), policy) &&
>  			can_vma_merge_after(prev, vm_flags,
>  					    anon_vma, file, pgoff,
> -					    vm_userfaultfd_ctx)) {
> +					    vm_userfaultfd_ctx, anon_name)) {
>  		/*
>  		 * OK, it can.  Can we now merge in the successor as well?
>  		 */
> @@ -1202,7 +1208,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  				can_vma_merge_before(next, vm_flags,
>  						     anon_vma, file,
>  						     pgoff+pglen,
> -						     vm_userfaultfd_ctx) &&
> +						     vm_userfaultfd_ctx, anon_name) &&
>  				is_mergeable_anon_vma(prev->anon_vma,
>  						      next->anon_vma, NULL)) {
>  							/* cases 1, 6 */
> @@ -1225,7 +1231,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  			mpol_equal(policy, vma_policy(next)) &&
>  			can_vma_merge_before(next, vm_flags,
>  					     anon_vma, file, pgoff+pglen,
> -					     vm_userfaultfd_ctx)) {
> +					     vm_userfaultfd_ctx, anon_name)) {
>  		if (prev && addr < prev->vm_end)	/* case 4 */
>  			err = __vma_adjust(prev, prev->vm_start,
>  					 addr, prev->vm_pgoff, NULL, next);
> @@ -1760,7 +1766,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 * Can we just expand an old mapping?
>  	 */
>  	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> -			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
> +			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
>  	if (vma)
>  		goto out;
>  
> @@ -1819,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		 */
>  		if (unlikely(vm_flags != vma->vm_flags && prev)) {
>  			merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
> -				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
> +				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
>  			if (merge) {
>  				/* ->mmap() can change vma->vm_file and fput the original file. So
>  				 * fput the vma->vm_file here or we would add an extra fput for file
> @@ -3081,7 +3087,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
>  
>  	/* Can we just expand an old private anonymous mapping? */
>  	vma = vma_merge(mm, prev, addr, addr + len, flags,
> -			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
> +			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
>  	if (vma)
>  		goto out;
>  
> @@ -3274,7 +3280,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		return NULL;	/* should never get here */
>  	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
>  			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> -			    vma->vm_userfaultfd_ctx);
> +			    vma->vm_userfaultfd_ctx, vma_anon_name(vma));
>  	if (new_vma) {
>  		/*
>  		 * Source vma may have been merged into new_vma
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 883e2cc85cad..a48ff8e79f48 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -464,7 +464,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>  	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>  	*pprev = vma_merge(mm, *pprev, start, end, newflags,
>  			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> -			   vma->vm_userfaultfd_ctx);
> +			   vma->vm_userfaultfd_ctx, vma_anon_name(vma));
>  	if (*pprev) {
>  		vma = *pprev;
>  		VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
> -- 
> 2.33.0.153.gba50c8fa24-goog
> 

Cool. With my notes above addressed, please consider this:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
  2021-09-03 21:35   ` Kees Cook
@ 2021-09-03 21:47   ` Kees Cook
  2021-09-03 21:56     ` Suren Baghdasaryan
  2021-09-06 16:55   ` Matthew Wilcox
  2021-10-01  7:01   ` Rasmus Villemoes
  3 siblings, 1 reply; 20+ messages in thread
From: Kees Cook @ 2021-09-03 21:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, ccross, sumit.semwal, mhocko, dave.hansen, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team

(Sorry, a few more things jumped out at me when I looked again...)

On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> [...]
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 72c7639e3c98..25118902a376 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>  
>  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
>  
> +#ifdef CONFIG_MMU
> +
> +#define ANON_VMA_NAME_MAX_LEN	256
> +
> +static inline bool is_valid_name_char(char ch)
> +{
> +	/* printable ascii characters, except [ \ ] */
> +	return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> +}

In the back of my mind, I feel like disallowing backtick would be nice,
but then if $, (, and ) are allowed, it doesn't matter, and that seems
too limiting. :)

> +
> +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> +			 unsigned long size, unsigned long arg)
> +{
> +	struct mm_struct *mm = current->mm;
> +	const char __user *uname;
> +	char *name, *pch;
> +	int error;
> +
> +	switch (opt) {
> +	case PR_SET_VMA_ANON_NAME:
> +		uname = (const char __user *)arg;
> +		if (!uname) {
> +			/* Reset the name */
> +			name = NULL;
> +			goto set_name;
> +		}
> +
> +		name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
> +
> +		if (IS_ERR(name))
> +			return PTR_ERR(name);
> +
> +		for (pch = name; *pch != '\0'; pch++) {
> +			if (!is_valid_name_char(*pch)) {
> +				kfree(name);
> +				return -EINVAL;
> +			}
> +		}
> +set_name:
> +		mmap_write_lock(mm);
> +		error = madvise_set_anon_name(mm, addr, size, name);
> +		mmap_write_unlock(mm);
> +		kfree(name);
> +		break;

This is a weird construct with a needless goto. Why not:

	switch (opt) {
	case PR_SET_VMA_ANON_NAME:
		uname = (const char __user *)arg;
		if (uname) {
			name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
			if (IS_ERR(name))
				return PTR_ERR(name);

			for (pch = name; *pch != '\0'; pch++) {
				if (!is_valid_name_char(*pch)) {
					kfree(name);
					return -EINVAL;
				}
			}
		} else {
			/* Reset the name */
			name = NULL;
		}
		mmap_write_lock(mm);
		error = madvise_set_anon_name(mm, addr, size, name);
		mmap_write_unlock(mm);
		kfree(name);
		break;


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-03 21:35   ` Kees Cook
@ 2021-09-03 21:51     ` Suren Baghdasaryan
  2021-09-05 13:04     ` Pavel Machek
  1 sibling, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-03 21:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Fri, Sep 3, 2021 at 2:35 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > From: Colin Cross <ccross@google.com>
> >
> > In many userspace applications, and especially in VM based applications
> > like Android uses heavily, there are multiple different allocators in use.
> >  At a minimum there is libc malloc and the stack, and in many cases there
> > are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> > multiple VM heaps (one for small objects, one for big objects, etc.).
> > Each of these layers usually has its own tools to inspect its usage;
> > malloc by compiling a debug version, the VM through heap inspection tools,
> > and for direct syscalls there is usually no way to track them.
> >
> > On Android we heavily use a set of tools that use an extended version of
> > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> > in userspace and slice their usage by process, shared (COW) vs.  unique
> > mappings, backing, etc.  This can account for real physical memory usage
> > even in cases like fork without exec (which Android uses heavily to share
> > as many private COW pages as possible between processes), Kernel SamePage
> > Merging, and clean zero pages.  It produces a measurement of the pages
> > that only exist in that process (USS, for unique), and a measurement of
> > the physical memory usage of that process with the cost of shared pages
> > being evenly split between processes that share them (PSS).
> >
> > If all anonymous memory is indistinguishable then figuring out the real
> > physical memory usage (PSS) of each heap requires either a pagemap walking
> > tool that can understand the heap debugging of every layer, or for every
> > layer's heap debugging tools to implement the pagemap walking logic, in
> > which case it is hard to get a consistent view of memory across the whole
> > system.
> >
> > Tracking the information in userspace leads to all sorts of problems.
> > It either needs to be stored inside the process, which means every
> > process has to have an API to export its current heap information upon
> > request, or it has to be stored externally in a filesystem that
> > somebody needs to clean up on crashes.  It needs to be readable while
> > the process is still running, so it has to have some sort of
> > synchronization with every layer of userspace.  Efficiently tracking
> > the ranges requires reimplementing something like the kernel vma
> > trees, and linking to it from every layer of userspace.  It requires
> > more memory, more syscalls, more runtime cost, and more complexity to
> > separately track regions that the kernel is already tracking.
> >
> > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> > userspace-provided name for anonymous vmas.  The names of named anonymous
> > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> >
> > Userspace can set the name for a region of memory by calling
> > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> > Setting the name to NULL clears it. The name length limit is 256 bytes
> > including NUL-terminator and is checked to contain only printable ascii
> > characters (including space), except '[','\' and ']'.
>
> Is the reason this isn't done via madvise() because we're forced into an
> "int" argument there? (Otherwise we could pass a pointer there.)
>
>         int madvise(void *addr, size_t length, int advice);
>

I'll need to check the original reasons this has been done with
prctl() but your guess might be correct. Also I'm not sure madvise()
would be an appropriate mechanism to set an attribute... That is not
really an "advice" but a "request".

> >
> > The name is stored in a pointer in the shared union in vm_area_struct
> > that points to a null terminated string. Anonymous vmas with the same
> > name (equivalent strings) and are otherwise mergeable will be merged.
> > The name pointers are not shared between vmas even if they contain the
> > same name. The name pointer is stored in a union with fields that are
> > only used on file-backed mappings, so it does not increase memory usage.
> >
> > The patch is based on the original patch developed by Colin Cross, more
> > specifically on its latest version [1] posted upstream by Sumit Semwal.
> > It used a userspace pointer to store vma names. In that design, name
> > pointers could be shared between vmas. However during the last upstreaming
> > attempt, Kees Cook raised concerns [2] about this approach and suggested
> > to copy the name into kernel memory space, perform validity checks [3]
> > and store as a string referenced from vm_area_struct.
> > One big concern is about fork() performance which would need to strdup
> > anonymous vma names. Dave Hansen suggested experimenting with worst-case
> > scenario of forking a process with 64k vmas having longest possible names
> > [4]. I ran this experiment on an ARM64 Android device and recorded a
> > worst-case regression of almost 40% when forking such a process. This
> > regression is addressed in the followup patch which replaces the pointer
> > to a name with a refcounted structure that allows sharing the name pointer
> > between vmas of the same name. Instead of duplicating the string during
> > fork() or when splitting a vma it increments the refcount.
> >
> > [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
> > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> > [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
> >
> > Changes for prctl(2) manual page (in the options section):
> >
> > PR_SET_VMA
> >       Sets an attribute specified in arg2 for virtual memory areas
> >       starting from the address specified in arg3 and spanning the
> >       size specified  in arg4. arg5 specifies the value of the attribute
> >       to be set. Note that assigning an attribute to a virtual memory
> >       area might prevent it from being merged with adjacent virtual
> >       memory areas due to the difference in that attribute's value.
> >
> >       Currently, arg2 must be one of:
> >
> >       PR_SET_VMA_ANON_NAME
> >               Set a name for anonymous virtual memory areas. arg5 should
> >               be a pointer to a null-terminated string containing the
> >               name. The name length including null byte cannot exceed
> >               256 bytes. If arg5 is NULL, the name of the appropriate
> >               anonymous virtual memory areas will be reset.
> >
> > Signed-off-by: Colin Cross <ccross@google.com>
> > [surenb: rebased over v5.14-rc7, replaced userpointer with a kernel copy
> > and added input sanitization. The bulk of the work here was done by Colin
> > Cross, therefore, with his permission, keeping him as the author]
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > previous version including cover letter with test results is at:
> > https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/
> >
> > changes in v9
> > - Added documentation for prctl(2) manual page describing newly introduced
> > options, per Pavel Machek
> > - Documented the downside of naming an anonymous vma which might prevent
> > it from being merged with adjacent vmas, per Cyrill Gorcunov
> > - Replaced seq_puts+seq_write with seq_printf, per Kees Cook
> > - Changed name validation to allow only printable ascii characters, except for
> > '[', '\' and ']', per Rasmus Villemoes
> > - Added madvise_set_anon_name definition dependency on CONFIG_PROC_FS,
> > per Michal Hocko
> > - Added NULL check for the name input in prctl_set_vma to correctly handle this
> > case, per Michal Hocko
> > - Handle the possibility of kstrdup returning NULL, per Rolf Eike Beer
> > - Changed max anon vma name length from 64 to 256 (as in the original patch)
> > because I found one case of the name length being 139 bytes. If anyone is
> > curious, here it is:
> > dalvik-/data/dalvik-cache/arm64/apex@com.android.permission@priv-app@GooglePermissionController@GooglePermissionController.apk@classes.art
> >
> >  Documentation/filesystems/proc.rst |   2 +
> >  fs/proc/task_mmu.c                 |  12 ++-
> >  fs/userfaultfd.c                   |   7 +-
> >  include/linux/mm.h                 |  13 ++-
> >  include/linux/mm_types.h           |  48 ++++++++++-
> >  include/uapi/linux/prctl.h         |   3 +
> >  kernel/fork.c                      |   2 +
> >  kernel/sys.c                       |  61 ++++++++++++++
> >  mm/madvise.c                       | 131 ++++++++++++++++++++++++++---
> >  mm/mempolicy.c                     |   3 +-
> >  mm/mlock.c                         |   2 +-
> >  mm/mmap.c                          |  38 +++++----
> >  mm/mprotect.c                      |   2 +-
> >  13 files changed, 283 insertions(+), 41 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 042c418f4090..a067eec54ef1 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -431,6 +431,8 @@ is not associated with a file:
> >   [stack]                    the stack of the main process
> >   [vdso]                     the "virtual dynamic shared object",
> >                              the kernel system call handler
> > +[anon:<name>]               an anonymous mapping that has been
> > +                            named by userspace
> >   =======                    ====================================
> >
> >   or if empty, the mapping is anonymous.
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index eb97468dfe4c..d41edb4b4540 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -308,6 +308,8 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
> >
> >       name = arch_vma_name(vma);
> >       if (!name) {
> > +             const char *anon_name;
> > +
> >               if (!mm) {
> >                       name = "[vdso]";
> >                       goto done;
> > @@ -319,8 +321,16 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
> >                       goto done;
> >               }
> >
> > -             if (is_stack(vma))
> > +             if (is_stack(vma)) {
> >                       name = "[stack]";
> > +                     goto done;
> > +             }
> > +
> > +             anon_name = vma_anon_name(vma);
> > +             if (anon_name) {
> > +                     seq_pad(m, ' ');
> > +                     seq_printf(m, "[anon:%s]", anon_name);
> > +             }
> >       }
> >
> >  done:
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 5c2d806e6ae5..5057843fb71a 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -876,7 +876,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> >                                new_flags, vma->anon_vma,
> >                                vma->vm_file, vma->vm_pgoff,
> >                                vma_policy(vma),
> > -                              NULL_VM_UFFD_CTX);
> > +                              NULL_VM_UFFD_CTX, vma_anon_name(vma));
> >               if (prev)
> >                       vma = prev;
> >               else
> > @@ -1440,7 +1440,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> >               prev = vma_merge(mm, prev, start, vma_end, new_flags,
> >                                vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> >                                vma_policy(vma),
> > -                              ((struct vm_userfaultfd_ctx){ ctx }));
> > +                              ((struct vm_userfaultfd_ctx){ ctx }),
> > +                              vma_anon_name(vma));
> >               if (prev) {
> >                       vma = prev;
> >                       goto next;
> > @@ -1617,7 +1618,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> >               prev = vma_merge(mm, prev, start, vma_end, new_flags,
> >                                vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> >                                vma_policy(vma),
> > -                              NULL_VM_UFFD_CTX);
> > +                              NULL_VM_UFFD_CTX, vma_anon_name(vma));
> >               if (prev) {
> >                       vma = prev;
> >                       goto next;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e59646a5d44d..c72226215f33 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2550,7 +2550,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >  extern struct vm_area_struct *vma_merge(struct mm_struct *,
> >       struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> >       unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> > -     struct mempolicy *, struct vm_userfaultfd_ctx);
> > +     struct mempolicy *, struct vm_userfaultfd_ctx, const char *);
> >  extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> >  extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> >       unsigned long addr, int new_below);
> > @@ -3285,5 +3285,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
> >       return 0;
> >  }
> >
> > +#if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_PROC_FS)
> > +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > +                       unsigned long len_in, const char *name);
> > +#else
> > +static inline int
> > +madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > +                   unsigned long len_in, const char *name) {
> > +     return 0;
> > +}
> > +#endif
> > +
> >  #endif /* __KERNEL__ */
> >  #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 7f8ee09c711f..968a1d0463d8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -350,11 +350,19 @@ struct vm_area_struct {
> >       /*
> >        * For areas with an address space and backing store,
> >        * linkage into the address_space->i_mmap interval tree.
> > +      *
> > +      * For private anonymous mappings, a pointer to a null terminated string
> > +      * containing the name given to the vma, or NULL if unnamed.
> >        */
> > -     struct {
> > -             struct rb_node rb;
> > -             unsigned long rb_subtree_last;
> > -     } shared;
> > +
> > +     union {
> > +             struct {
> > +                     struct rb_node rb;
> > +                     unsigned long rb_subtree_last;
> > +             } shared;
> > +             /* Serialized by mmap_sem. */
> > +             char *anon_name;
> > +     };
> >
> >       /*
> >        * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
> > @@ -809,4 +817,36 @@ typedef struct {
> >       unsigned long val;
> >  } swp_entry_t;
> >
> > +/*
> > + * mmap_lock should be read-locked when calling vma_anon_name() and while using
> > + * the returned pointer.
> > + */
> > +extern const char *vma_anon_name(struct vm_area_struct *vma);
> > +
> > +/*
> > + * mmap_lock should be read-locked for orig_vma->vm_mm.
> > + * mmap_lock should be write-locked for new_vma->vm_mm or new_vma should be
> > + * isolated.
> > + */
> > +extern void dup_vma_anon_name(struct vm_area_struct *orig_vma,
> > +                           struct vm_area_struct *new_vma);
> > +
> > +/*
> > + * mmap_lock should be write-locked or vma should have been isolated under
> > + * write-locked mmap_lock protection.
> > + */
> > +extern void free_vma_anon_name(struct vm_area_struct *vma);
> > +
> > +/* mmap_lock should be read-locked */
> > +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> > +                                      const char *name)
> > +{
> > +     const char *vma_name = vma_anon_name(vma);
> > +
> > +     if (likely(!vma_name))
> > +             return name == NULL;
> > +
> > +     return name && !strcmp(name, vma_name);
> > +}
> > +
> >  #endif /* _LINUX_MM_TYPES_H */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 43bd7f713c39..4c8cbf510b2d 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -269,4 +269,7 @@ struct prctl_mm_map {
> >  # define PR_SCHED_CORE_SHARE_FROM    3 /* pull core_sched cookie to pid */
> >  # define PR_SCHED_CORE_MAX           4
> >
> > +#define PR_SET_VMA           0x53564d41
> > +# define PR_SET_VMA_ANON_NAME                0
> > +
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 695d1343a254..cfb8c47564d8 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -366,12 +366,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >               *new = data_race(*orig);
> >               INIT_LIST_HEAD(&new->anon_vma_chain);
> >               new->vm_next = new->vm_prev = NULL;
> > +             dup_vma_anon_name(orig, new);
> >       }
> >       return new;
> >  }
> >
> >  void vm_area_free(struct vm_area_struct *vma)
> >  {
> > +     free_vma_anon_name(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 72c7639e3c98..25118902a376 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> >
> >  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> >
> > +#ifdef CONFIG_MMU
> > +
> > +#define ANON_VMA_NAME_MAX_LEN        256
> > +
> > +static inline bool is_valid_name_char(char ch)
> > +{
> > +     /* printable ascii characters, except [ \ ] */
> > +     return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> > +}
> > +
> > +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> > +                      unsigned long size, unsigned long arg)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     const char __user *uname;
> > +     char *name, *pch;
> > +     int error;
> > +
> > +     switch (opt) {
> > +     case PR_SET_VMA_ANON_NAME:
> > +             uname = (const char __user *)arg;
> > +             if (!uname) {
> > +                     /* Reset the name */
> > +                     name = NULL;
> > +                     goto set_name;
> > +             }
> > +
> > +             name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
> > +
> > +             if (IS_ERR(name))
> > +                     return PTR_ERR(name);
> > +
> > +             for (pch = name; *pch != '\0'; pch++) {
> > +                     if (!is_valid_name_char(*pch)) {
> > +                             kfree(name);
> > +                             return -EINVAL;
> > +                     }
> > +             }
> > +set_name:
> > +             mmap_write_lock(mm);
> > +             error = madvise_set_anon_name(mm, addr, size, name);
> > +             mmap_write_unlock(mm);
> > +             kfree(name);
> > +             break;
> > +     default:
> > +             error = -EINVAL;
> > +     }
> > +
> > +     return error;
> > +}
> > +#else /* CONFIG_MMU */
> > +static int prctl_set_vma(unsigned long opt, unsigned long start,
> > +                      unsigned long size, unsigned long arg)
> > +{
> > +     return -EINVAL;
> > +}
> > +#endif
> > +
> >  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >               unsigned long, arg4, unsigned long, arg5)
> >  {
> > @@ -2568,6 +2626,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >               error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> >               break;
> >  #endif
> > +     case PR_SET_VMA:
> > +             error = prctl_set_vma(arg2, arg3, arg4, arg5);
> > +             break;
> >       default:
> >               error = -EINVAL;
> >               break;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 54bf9f73f95d..0c6d0f64d432 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/fadvise.h>
> >  #include <linux/sched.h>
> >  #include <linux/sched/mm.h>
> > +#include <linux/string.h>
> >  #include <linux/uio.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> > @@ -62,19 +63,78 @@ static int madvise_need_mmap_write(int behavior)
> >       }
> >  }
> >
> > +static inline bool has_vma_anon_name(struct vm_area_struct *vma)
> > +{
> > +     return !vma->vm_file && vma->anon_name;
> > +}
> > +
> > +const char *vma_anon_name(struct vm_area_struct *vma)
> > +{
> > +     if (!has_vma_anon_name(vma))
> > +             return NULL;
> > +
> > +     mmap_assert_locked(vma->vm_mm);
> > +
> > +     return vma->anon_name;
> > +}
> > +
> > +void dup_vma_anon_name(struct vm_area_struct *orig_vma,
> > +                    struct vm_area_struct *new_vma)
> > +{
> > +     if (!has_vma_anon_name(orig_vma))
> > +             return;
> > +
> > +     new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
> > +}
> > +
> > +void free_vma_anon_name(struct vm_area_struct *vma)
> > +{
> > +     if (!has_vma_anon_name(vma))
> > +             return;
> > +
> > +     kfree(vma->anon_name);
> > +     vma->anon_name = NULL;
> > +}
> > +
> > +/* mmap_lock should be write-locked */
> > +static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> > +{
> > +     if (!name) {
> > +             free_vma_anon_name(vma);
> > +             return 0;
> > +     }
> > +
> > +     if (vma->anon_name) {
> > +             /* Should never happen, to dup use dup_vma_anon_name() */
> > +             WARN_ON(vma->anon_name == name);
> > +
> > +             /* Same name, nothing to do here */
> > +             if (!strcmp(name, vma->anon_name))
> > +                     return 0;
> > +
> > +             free_vma_anon_name(vma);
> > +     }
> > +     vma->anon_name = kstrdup(name, GFP_KERNEL);
> > +     if (!vma->anon_name)
> > +             return -ENOMEM;
> > +
> > +     return 0;
> > +}
> > +
> >  /*
> > - * Update the vm_flags on regiion of a vma, splitting it or merging it as
> > + * Update the vm_flags on region of a vma, splitting it or merging it as
> >   * necessary.  Must be called with mmap_sem held for writing;
> >   */
> >  static int madvise_update_vma(struct vm_area_struct *vma,
> >                             struct vm_area_struct **prev, unsigned long start,
> > -                           unsigned long end, unsigned long new_flags)
> > +                           unsigned long end, unsigned long new_flags,
> > +                           const char *name)
> >  {
> >       struct mm_struct *mm = vma->vm_mm;
> >       int error;
> >       pgoff_t pgoff;
> >
> > -     if (new_flags == vma->vm_flags) {
> > +     if (new_flags == vma->vm_flags && is_same_vma_anon_name(vma, name)) {
> >               *prev = vma;
> >               return 0;
> >       }
> > @@ -82,7 +142,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> >       pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >       *prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
> >                         vma->vm_file, pgoff, vma_policy(vma),
> > -                       vma->vm_userfaultfd_ctx);
> > +                       vma->vm_userfaultfd_ctx, name);
> >       if (*prev) {
> >               vma = *prev;
> >               goto success;
> > @@ -91,20 +151,16 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> >       *prev = vma;
> >
> >       if (start != vma->vm_start) {
> > -             if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> > -                     error = -ENOMEM;
> > -                     return error;
> > -             }
> > +             if (unlikely(mm->map_count >= sysctl_max_map_count))
> > +                     return -ENOMEM;
> >               error = __split_vma(mm, vma, start, 1);
> >               if (error)
> >                       return error;
> >       }
> >
> >       if (end != vma->vm_end) {
> > -             if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> > -                     error = -ENOMEM;
> > -                     return error;
> > -             }
> > +             if (unlikely(mm->map_count >= sysctl_max_map_count))
> > +                     return -ENOMEM;
> >               error = __split_vma(mm, vma, end, 0);
> >               if (error)
> >                       return error;
> > @@ -115,10 +171,33 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> >        * vm_flags is protected by the mmap_lock held in write mode.
> >        */
> >       vma->vm_flags = new_flags;
> > +     if (!vma->vm_file) {
> > +             error = replace_vma_anon_name(vma, name);
> > +             if (error)
> > +                     return error;
> > +     }
> >
> >       return 0;
> >  }
> >
> > +static int madvise_vma_anon_name(struct vm_area_struct *vma,
> > +                              struct vm_area_struct **prev,
> > +                              unsigned long start, unsigned long end,
> > +                              unsigned long name)
> > +{
> > +     int error;
> > +
> > +     /* Only anonymous mappings can be named */
> > +     if (vma->vm_file)
> > +             return -EINVAL;
>
> To distinguish from the other EINVALs, should this maybe be EBADF? (As
> in "no, you can't do that, there is already an fd associated".)

Maybe. Although it's a bit strange to talk about a "bad file
descriptor" when the area is supposed to be anonymous. IMHO EINVAL is
the right answer here, but I might be convinced otherwise.

>
> > +
> > +     error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
> > +                                (const char *)name);
> > +     if (error == -ENOMEM)
> > +             error = -EAGAIN;
>
> I think a comment would be useful here too. AIUI, this is done to match
> the error behavior seen under madvise_vma_behavior().

Ack.

>
> > +     return error;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> >       unsigned long end, struct mm_walk *walk)
> > @@ -942,7 +1021,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >               break;
> >       }
> >
> > -     error = madvise_update_vma(vma, prev, start, end, new_flags);
> > +     error = madvise_update_vma(vma, prev, start, end, new_flags,
> > +                                vma_anon_name(vma));
> >
> >  out:
> >       /*
> > @@ -1121,6 +1201,31 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> >       return unmapped_error;
> >  }
> >
> > +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > +                       unsigned long len_in, const char *name)
> > +{
> > +     unsigned long end;
> > +     unsigned long len;
> > +
> > +     if (start & ~PAGE_MASK)
> > +             return -EINVAL;
> > +     len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > +
> > +     /* Check to see whether len was rounded up from small -ve to zero */
> > +     if (len_in && !len)
> > +             return -EINVAL;
> > +
> > +     end = start + len;
> > +     if (end < start)
> > +             return -EINVAL;
> > +
> > +     if (end == start)
> > +             return 0;
> > +
> > +     return madvise_walk_vmas(mm, start, end, (unsigned long)name,
> > +                              madvise_vma_anon_name);
> > +}
> > +
> >  /*
> >   * The madvise(2) system call.
> >   *
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index e32360e90274..cc21ca7e9d40 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -811,7 +811,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
> >                       ((vmstart - vma->vm_start) >> PAGE_SHIFT);
> >               prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
> >                                vma->anon_vma, vma->vm_file, pgoff,
> > -                              new_pol, vma->vm_userfaultfd_ctx);
> > +                              new_pol, vma->vm_userfaultfd_ctx,
> > +                              vma_anon_name(vma));
> >               if (prev) {
> >                       vma = prev;
> >                       next = vma->vm_next;
> > diff --git a/mm/mlock.c b/mm/mlock.c
> > index 16d2ee160d43..c878515680af 100644
> > --- a/mm/mlock.c
> > +++ b/mm/mlock.c
> > @@ -511,7 +511,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >       pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >       *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
> >                         vma->vm_file, pgoff, vma_policy(vma),
> > -                       vma->vm_userfaultfd_ctx);
> > +                       vma->vm_userfaultfd_ctx, vma_anon_name(vma));
> >       if (*prev) {
> >               vma = *prev;
> >               goto success;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 181a113b545d..c13934d41f65 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1032,7 +1032,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> >   */
> >  static inline int is_mergeable_vma(struct vm_area_struct *vma,
> >                               struct file *file, unsigned long vm_flags,
> > -                             struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> > +                             struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> > +                             const char *anon_name)
> >  {
> >       /*
> >        * VM_SOFTDIRTY should not prevent from VMA merging, if we
> > @@ -1050,6 +1051,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
> >               return 0;
> >       if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
> >               return 0;
> > +     if (!is_same_vma_anon_name(vma, anon_name))
> > +             return 0;
> >       return 1;
> >  }
> >
> > @@ -1082,9 +1085,10 @@ static int
> >  can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
> >                    struct anon_vma *anon_vma, struct file *file,
> >                    pgoff_t vm_pgoff,
> > -                  struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> > +                  struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> > +                  const char *anon_name)
> >  {
> > -     if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
> > +     if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
> >           is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
> >               if (vma->vm_pgoff == vm_pgoff)
> >                       return 1;
> > @@ -1103,9 +1107,10 @@ static int
> >  can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
> >                   struct anon_vma *anon_vma, struct file *file,
> >                   pgoff_t vm_pgoff,
> > -                 struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> > +                 struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> > +                  const char *anon_name)
>
> Nit: one too many spaces before "const".

Ack.

>
> >  {
> > -     if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
> > +     if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
> >           is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
> >               pgoff_t vm_pglen;
> >               vm_pglen = vma_pages(vma);
> > @@ -1116,9 +1121,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
> >  }
> >
> >  /*
> > - * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
> > - * whether that can be merged with its predecessor or its successor.
> > - * Or both (it neatly fills a hole).
> > + * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
> > + * figure out whether that can be merged with its predecessor or its
> > + * successor.  Or both (it neatly fills a hole).
> >   *
> >   * In most cases - when called for mmap, brk or mremap - [addr,end) is
> >   * certain not to be mapped by the time vma_merge is called; but when
> > @@ -1163,7 +1168,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >                       unsigned long end, unsigned long vm_flags,
> >                       struct anon_vma *anon_vma, struct file *file,
> >                       pgoff_t pgoff, struct mempolicy *policy,
> > -                     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> > +                     struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> > +                     const char *anon_name)
> >  {
> >       pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
> >       struct vm_area_struct *area, *next;
> > @@ -1193,7 +1199,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >                       mpol_equal(vma_policy(prev), policy) &&
> >                       can_vma_merge_after(prev, vm_flags,
> >                                           anon_vma, file, pgoff,
> > -                                         vm_userfaultfd_ctx)) {
> > +                                         vm_userfaultfd_ctx, anon_name)) {
> >               /*
> >                * OK, it can.  Can we now merge in the successor as well?
> >                */
> > @@ -1202,7 +1208,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >                               can_vma_merge_before(next, vm_flags,
> >                                                    anon_vma, file,
> >                                                    pgoff+pglen,
> > -                                                  vm_userfaultfd_ctx) &&
> > +                                                  vm_userfaultfd_ctx, anon_name) &&
> >                               is_mergeable_anon_vma(prev->anon_vma,
> >                                                     next->anon_vma, NULL)) {
> >                                                       /* cases 1, 6 */
> > @@ -1225,7 +1231,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> >                       mpol_equal(policy, vma_policy(next)) &&
> >                       can_vma_merge_before(next, vm_flags,
> >                                            anon_vma, file, pgoff+pglen,
> > -                                          vm_userfaultfd_ctx)) {
> > +                                          vm_userfaultfd_ctx, anon_name)) {
> >               if (prev && addr < prev->vm_end)        /* case 4 */
> >                       err = __vma_adjust(prev, prev->vm_start,
> >                                        addr, prev->vm_pgoff, NULL, next);
> > @@ -1760,7 +1766,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >        * Can we just expand an old mapping?
> >        */
> >       vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> > -                     NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
> > +                     NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
> >       if (vma)
> >               goto out;
> >
> > @@ -1819,7 +1825,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >                */
> >               if (unlikely(vm_flags != vma->vm_flags && prev)) {
> >                       merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
> > -                             NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
> > +                             NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
> >                       if (merge) {
> >                               /* ->mmap() can change vma->vm_file and fput the original file. So
> >                                * fput the vma->vm_file here or we would add an extra fput for file
> > @@ -3081,7 +3087,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> >
> >       /* Can we just expand an old private anonymous mapping? */
> >       vma = vma_merge(mm, prev, addr, addr + len, flags,
> > -                     NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
> > +                     NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
> >       if (vma)
> >               goto out;
> >
> > @@ -3274,7 +3280,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >               return NULL;    /* should never get here */
> >       new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> >                           vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> > -                         vma->vm_userfaultfd_ctx);
> > +                         vma->vm_userfaultfd_ctx, vma_anon_name(vma));
> >       if (new_vma) {
> >               /*
> >                * Source vma may have been merged into new_vma
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 883e2cc85cad..a48ff8e79f48 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -464,7 +464,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >       pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> >       *pprev = vma_merge(mm, *pprev, start, end, newflags,
> >                          vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> > -                        vma->vm_userfaultfd_ctx);
> > +                        vma->vm_userfaultfd_ctx, vma_anon_name(vma));
> >       if (*pprev) {
> >               vma = *pprev;
> >               VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
> > --
> > 2.33.0.153.gba50c8fa24-goog
> >
>
> Cool. With my notes above addressed, please consider this:
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>

Thanks! Will copy it into the next rev with fixes.
Suren.

> --
> Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-03 21:47   ` Kees Cook
@ 2021-09-03 21:56     ` Suren Baghdasaryan
  2021-09-03 22:28       ` Kees Cook
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-03 21:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Fri, Sep 3, 2021 at 2:47 PM Kees Cook <keescook@chromium.org> wrote:
>
> (Sorry, a few more things jumped out at me when I looked again...)
>
> On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > [...]
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 72c7639e3c98..25118902a376 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> >
> >  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> >
> > +#ifdef CONFIG_MMU
> > +
> > +#define ANON_VMA_NAME_MAX_LEN        256
> > +
> > +static inline bool is_valid_name_char(char ch)
> > +{
> > +     /* printable ascii characters, except [ \ ] */
> > +     return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> > +}
>
> In the back of my mind, I feel like disallowing backtick would be nice,
> but then if $, (, and ) are allowed, it doesn't matter, and that seems
> too limiting. :)

It's not used by the only current user (Android) and we can always
allow more chars later. However going the other direction and
disallowing some of them I think would be harder (need to make sure
nobody uses them). WDYT if we keep it stricter and relax if needed?

>
> > +
> > +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> > +                      unsigned long size, unsigned long arg)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     const char __user *uname;
> > +     char *name, *pch;
> > +     int error;
> > +
> > +     switch (opt) {
> > +     case PR_SET_VMA_ANON_NAME:
> > +             uname = (const char __user *)arg;
> > +             if (!uname) {
> > +                     /* Reset the name */
> > +                     name = NULL;
> > +                     goto set_name;
> > +             }
> > +
> > +             name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
> > +
> > +             if (IS_ERR(name))
> > +                     return PTR_ERR(name);
> > +
> > +             for (pch = name; *pch != '\0'; pch++) {
> > +                     if (!is_valid_name_char(*pch)) {
> > +                             kfree(name);
> > +                             return -EINVAL;
> > +                     }
> > +             }
> > +set_name:
> > +             mmap_write_lock(mm);
> > +             error = madvise_set_anon_name(mm, addr, size, name);
> > +             mmap_write_unlock(mm);
> > +             kfree(name);
> > +             break;
>
> This is a weird construct with a needless goto. Why not:
>
>         switch (opt) {
>         case PR_SET_VMA_ANON_NAME:
>                 uname = (const char __user *)arg;
>                 if (uname) {
>                         name = strndup_user(uname, ANON_VMA_NAME_MAX_LEN);
>                         if (IS_ERR(name))
>                                 return PTR_ERR(name);
>
>                         for (pch = name; *pch != '\0'; pch++) {
>                                 if (!is_valid_name_char(*pch)) {
>                                         kfree(name);
>                                         return -EINVAL;
>                                 }
>                         }
>                 } else {
>                         /* Reset the name */
>                         name = NULL;
>                 }
>                 mmap_write_lock(mm);
>                 error = madvise_set_anon_name(mm, addr, size, name);
>                 mmap_write_unlock(mm);
>                 kfree(name);
>                 break;

Yeah, I was contemplating one way or the other (less indents vs clear
flow) and you convinced me :) Will change in the next rev.
Thanks for the review!

>
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 3/3] mm: add anonymous vma name refcounting
  2021-09-02 23:18 ` [PATCH v9 3/3] mm: add anonymous vma name refcounting Suren Baghdasaryan
@ 2021-09-03 22:20   ` Kees Cook
  0 siblings, 0 replies; 20+ messages in thread
From: Kees Cook @ 2021-09-03 22:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, ccross, sumit.semwal, mhocko, dave.hansen, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team

On Thu, Sep 02, 2021 at 04:18:13PM -0700, Suren Baghdasaryan wrote:
> While forking a process with high number (64K) of named anonymous vmas the
> overhead caused by strdup() is noticeable. Experiments with ARM64 Android
> device show up to 40% performance regression when forking a process with
> 64k unpopulated anonymous vmas using the max name lengths vs the same
> process with the same number of anonymous vmas having no name.
> Introduce anon_vma_name refcounted structure to avoid the overhead of
> copying vma names during fork() and when splitting named anonymous vmas.
> When a vma is duplicated, instead of copying the name we increment the
> refcount of this structure. Multiple vmas can point to the same
> anon_vma_name as long as they increment the refcount. The name member of
> anon_vma_name structure is assigned at structure allocation time and is
> never changed. If vma name changes then the refcount of the original
> structure is dropped, a new anon_vma_name structure is allocated
> to hold the new name and the vma pointer is updated to point to the new
> structure.
> With this approach the fork() performance regressions is reduced 3-4x
> times and with usecases using more reasonable number of VMAs (a few
> thousand) the regressions is not measurable.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
> previous version including cover letter with test results is at:
> https://lore.kernel.org/linux-mm/20210827191858.2037087-1-surenb@google.com/
> 
> changes in v9
> - Replaced kzalloc with kmalloc in anon_vma_name_alloc, per Rolf Eike Beer
> 
>  include/linux/mm_types.h |  9 ++++++++-
>  mm/madvise.c             | 43 +++++++++++++++++++++++++++++++++-------
>  2 files changed, 44 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 968a1d0463d8..7feb43daee6c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -5,6 +5,7 @@
>  #include <linux/mm_types_task.h>
>  
>  #include <linux/auxvec.h>
> +#include <linux/kref.h>
>  #include <linux/list.h>
>  #include <linux/spinlock.h>
>  #include <linux/rbtree.h>
> @@ -310,6 +311,12 @@ struct vm_userfaultfd_ctx {
>  struct vm_userfaultfd_ctx {};
>  #endif /* CONFIG_USERFAULTFD */
>  
> +struct anon_vma_name {
> +	struct kref kref;
> +	/* The name needs to be at the end because it is dynamically sized. */
> +	char name[];
> +};
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -361,7 +368,7 @@ struct vm_area_struct {
>  			unsigned long rb_subtree_last;
>  		} shared;
>  		/* Serialized by mmap_sem. */
> -		char *anon_name;
> +		struct anon_vma_name *anon_name;
>  	};
>  
>  	/*
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0c6d0f64d432..adc53edd3fe7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -63,6 +63,28 @@ static int madvise_need_mmap_write(int behavior)
>  	}
>  }
>  
> +static struct anon_vma_name *anon_vma_name_alloc(const char *name)
> +{
> +	struct anon_vma_name *anon_name;
> +	size_t len = strlen(name);
> +
> +	/* Add 1 for NUL terminator at the end of the anon_name->name */
> +	anon_name = kmalloc(sizeof(*anon_name) + len + 1, GFP_KERNEL);
> +	if (anon_name) {
> +		kref_init(&anon_name->kref);
> +		strcpy(anon_name->name, name);

Please don't use strcpy(), even though we know it's safe here. We're
trying to remove it globally (or at least for non-constant buffers)[1].
We can also use the struct_size() helper, along with memcpy():

	/* Add 1 for NUL terminator at the end of the anon_name->name */
	size_t count = strlen(name) + 1;

	anon_name = kmalloc(struct_size(anon_name, name, count), GFP_KERNEL);
	if (anon_name) {
		kref_init(&anon_name->kref);
		memcpy(anon_name->name, name, count);
	}

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy

> +	}
> +
> +	return anon_name;
> +}
> +
> +static void vma_anon_name_free(struct kref *kref)
> +{
> +	struct anon_vma_name *anon_name =
> +			container_of(kref, struct anon_vma_name, kref);
> +	kfree(anon_name);
> +}
> +
>  static inline bool has_vma_anon_name(struct vm_area_struct *vma)
>  {
>  	return !vma->vm_file && vma->anon_name;
> @@ -75,7 +97,7 @@ const char *vma_anon_name(struct vm_area_struct *vma)
>  
>  	mmap_assert_locked(vma->vm_mm);
>  
> -	return vma->anon_name;
> +	return vma->anon_name->name;
>  }
>  
>  void dup_vma_anon_name(struct vm_area_struct *orig_vma,
> @@ -84,37 +106,44 @@ void dup_vma_anon_name(struct vm_area_struct *orig_vma,
>  	if (!has_vma_anon_name(orig_vma))
>  		return;
>  
> -	new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
> +	kref_get(&orig_vma->anon_name->kref);
> +	new_vma->anon_name = orig_vma->anon_name;
>  }
>  
>  void free_vma_anon_name(struct vm_area_struct *vma)
>  {
> +	struct anon_vma_name *anon_name;
> +
>  	if (!has_vma_anon_name(vma))
>  		return;
>  
> -	kfree(vma->anon_name);
> +	anon_name = vma->anon_name;
>  	vma->anon_name = NULL;
> +	kref_put(&anon_name->kref, vma_anon_name_free);
>  }
>  
>  /* mmap_lock should be write-locked */
>  static int replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
>  {
> +	const char *anon_name;
> +
>  	if (!name) {
>  		free_vma_anon_name(vma);
>  		return 0;
>  	}
>  
> -	if (vma->anon_name) {
> +	anon_name = vma_anon_name(vma);
> +	if (anon_name) {
>  		/* Should never happen, to dup use dup_vma_anon_name() */
> -		WARN_ON(vma->anon_name == name);
> +		WARN_ON(anon_name == name);
>  
>  		/* Same name, nothing to do here */
> -		if (!strcmp(name, vma->anon_name))
> +		if (!strcmp(name, anon_name))
>  			return 0;
>  
>  		free_vma_anon_name(vma);
>  	}
> -	vma->anon_name = kstrdup(name, GFP_KERNEL);
> +	vma->anon_name = anon_vma_name_alloc(name);
>  	if (!vma->anon_name)
>  		return -ENOMEM;
>  
> -- 
> 2.33.0.153.gba50c8fa24-goog
> 

With the above tweak, please consider this:

Reviewed-by: Kees Cook <keescook@chromium.org>

Thanks for working on this!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-03 21:56     ` Suren Baghdasaryan
@ 2021-09-03 22:28       ` Kees Cook
  2021-10-01  3:44         ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Kees Cook @ 2021-09-03 22:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Fri, Sep 03, 2021 at 02:56:21PM -0700, Suren Baghdasaryan wrote:
> On Fri, Sep 3, 2021 at 2:47 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > (Sorry, a few more things jumped out at me when I looked again...)
> >
> > On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > > [...]
> > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > index 72c7639e3c98..25118902a376 100644
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> > >
> > >  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> > >
> > > +#ifdef CONFIG_MMU
> > > +
> > > +#define ANON_VMA_NAME_MAX_LEN        256
> > > +
> > > +static inline bool is_valid_name_char(char ch)
> > > +{
> > > +     /* printable ascii characters, except [ \ ] */
> > > +     return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> > > +}
> >
> > In the back of my mind, I feel like disallowing backtick would be nice,
> > but then if $, (, and ) are allowed, it doesn't matter, and that seems
> > too limiting. :)
> 
> It's not used by the only current user (Android) and we can always
> allow more chars later. However going the other direction and
> disallowing some of them I think would be harder (need to make sure
> nobody uses them). WDYT if we keep it stricter and relax if needed?

I'd say, if we can also drop each of: ` $ ( )
then let's do it. Better to keep the obvious shell meta-characters out
of this, although I don't feel strongly about it. Anything that might
get confused by this would be similarly confused by binary names too:

$ cat /proc/3407216/maps
560bdafd4000-560bdafd6000 r--p 00000000 fd:02 2621909 /tmp/yay`wat

And it's probably easier to change a binary name than to call prctl. :P

I'm good either way. What you have now is great, but if we wanted to be
extra extra strict, we can add the other 4 above.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-03 21:35   ` Kees Cook
  2021-09-03 21:51     ` Suren Baghdasaryan
@ 2021-09-05 13:04     ` Pavel Machek
  2021-09-06 15:52       ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2021-09-05 13:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: Suren Baghdasaryan, akpm, ccross, sumit.semwal, mhocko,
	dave.hansen, willy, kirill.shutemov, vbabka, hannes, corbet,
	viro, rdunlap, kaleshsingh, peterx, rppt, peterz,
	catalin.marinas, vincenzo.frascino, chinwen.chang, axelrasmussen,
	aarcange, jannh, apopple, jhubbard, yuzhao, will, fenghua.yu,
	thunder.leizhen, hughd, feng.tang, jgg, guro, tglx, krisman,
	chris.hyser, pcc, ebiederm, axboe, legion, eb, gorcunov,
	songmuchun, viresh.kumar, thomascedeno, sashal, cxfcosmos, linux,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm, kernel-team

Hi!

> > the process is still running, so it has to have some sort of
> > synchronization with every layer of userspace.  Efficiently tracking
> > the ranges requires reimplementing something like the kernel vma
> > trees, and linking to it from every layer of userspace.  It requires
> > more memory, more syscalls, more runtime cost, and more complexity to
> > separately track regions that the kernel is already tracking.

Ok so far.

> > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> > userspace-provided name for anonymous vmas.  The names of named anonymous
> > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> > 
> > Userspace can set the name for a region of memory by calling
> > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned
> > long)name);

Would setting a 64-bit integer instead of name be enough? Even if
each party would set it randomly, risk of collisions would be very
low... and we'd not have to deal with strings in kernel.

								Pavel


-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-05 13:04     ` Pavel Machek
@ 2021-09-06 15:52       ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-06 15:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kees Cook, Andrew Morton, Colin Cross, Sumit Semwal,
	Michal Hocko, Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Sun, Sep 5, 2021 at 6:04 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Hi!
>
> > > the process is still running, so it has to have some sort of
> > > synchronization with every layer of userspace.  Efficiently tracking
> > > the ranges requires reimplementing something like the kernel vma
> > > trees, and linking to it from every layer of userspace.  It requires
> > > more memory, more syscalls, more runtime cost, and more complexity to
> > > separately track regions that the kernel is already tracking.
>
> Ok so far.
>
> > > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> > > userspace-provided name for anonymous vmas.  The names of named anonymous
> > > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> > >
> > > Userspace can set the name for a region of memory by calling
> > > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned
> > > long)name);
>
> Would setting a 64-bit integer instead of name be enough? Even if
> each party would set it randomly, risk of collisions would be very
> low... and we'd not have to deal with strings in kernel.

Thanks for the question, Pavel. I believe this was discussed in this
thread before and Colin provided the explanation with usage examples:
https://lore.kernel.org/linux-mm/20200821070552.GW2074@grain/.
Thanks,
Suren.

>
>                                                                 Pavel
>
>
> --
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
  2021-09-03 21:35   ` Kees Cook
  2021-09-03 21:47   ` Kees Cook
@ 2021-09-06 16:55   ` Matthew Wilcox
  2021-09-09  4:05     ` Suren Baghdasaryan
  2021-10-01  7:01   ` Rasmus Villemoes
  3 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-09-06 16:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, ccross, sumit.semwal, mhocko, dave.hansen, keescook,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team

On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.

I understand that the information is currently incoherent, but why is
this the right way to make it coherent?  It would seem more useful to
use something like one of the tracing mechanisms (eg ftrace, LTTng,
whatever the current hotness is in userspace tracing) for the malloc
library to log all the useful information, instead of injecting a subset
of it into the kernel for userspace to read out again.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-06 16:55   ` Matthew Wilcox
@ 2021-09-09  4:05     ` Suren Baghdasaryan
  2021-09-30 18:56       ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-09  4:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Kees Cook, Kirill A . Shutemov, Vlastimil Babka,
	Johannes Weiner, Jonathan Corbet, Al Viro, Randy Dunlap,
	Kalesh Singh, Peter Xu, rppt, Peter Zijlstra, Catalin Marinas,
	vincenzo.frascino, Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Mon, Sep 6, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > On Android we heavily use a set of tools that use an extended version of
> > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> > in userspace and slice their usage by process, shared (COW) vs.  unique
> > mappings, backing, etc.  This can account for real physical memory usage
> > even in cases like fork without exec (which Android uses heavily to share
> > as many private COW pages as possible between processes), Kernel SamePage
> > Merging, and clean zero pages.  It produces a measurement of the pages
> > that only exist in that process (USS, for unique), and a measurement of
> > the physical memory usage of that process with the cost of shared pages
> > being evenly split between processes that share them (PSS).
> >
> > If all anonymous memory is indistinguishable then figuring out the real
> > physical memory usage (PSS) of each heap requires either a pagemap walking
> > tool that can understand the heap debugging of every layer, or for every
> > layer's heap debugging tools to implement the pagemap walking logic, in
> > which case it is hard to get a consistent view of memory across the whole
> > system.
> >
> > Tracking the information in userspace leads to all sorts of problems.
> > It either needs to be stored inside the process, which means every
> > process has to have an API to export its current heap information upon
> > request, or it has to be stored externally in a filesystem that
> > somebody needs to clean up on crashes.  It needs to be readable while
> > the process is still running, so it has to have some sort of
> > synchronization with every layer of userspace.  Efficiently tracking
> > the ranges requires reimplementing something like the kernel vma
> > trees, and linking to it from every layer of userspace.  It requires
> > more memory, more syscalls, more runtime cost, and more complexity to
> > separately track regions that the kernel is already tracking.
>
> I understand that the information is currently incoherent, but why is
> this the right way to make it coherent?  It would seem more useful to
> use something like one of the tracing mechanisms (eg ftrace, LTTng,
> whatever the current hotness is in userspace tracing) for the malloc
> library to log all the useful information, instead of injecting a subset
> of it into the kernel for userspace to read out again.

Sorry, for the delay with the response. I'm travelling and my internet
access is very patchy.

Just to clarify, your suggestion is to require userspace to log any
allocation using ftrace or a similar mechanism and then for the system
to parse these logs to calculate the memory usage for each process?
I didn't think much in this direction but I guess logging each
allocation in the system and periodically collecting that data would
be quite expensive both from memory usage and performance POV. I'll
need to think a bit more but these are to me the obvious downsides of
this approach.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-09  4:05     ` Suren Baghdasaryan
@ 2021-09-30 18:56       ` Suren Baghdasaryan
  2021-09-30 23:25         ` Kees Cook
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-09-30 18:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Kees Cook, Kirill A . Shutemov, Vlastimil Babka,
	Johannes Weiner, Jonathan Corbet, Al Viro, Randy Dunlap,
	Kalesh Singh, Peter Xu, rppt, Peter Zijlstra, Catalin Marinas,
	vincenzo.frascino, Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Wed, Sep 8, 2021 at 9:05 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Sep 6, 2021 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > > On Android we heavily use a set of tools that use an extended version of
> > > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> > > in userspace and slice their usage by process, shared (COW) vs.  unique
> > > mappings, backing, etc.  This can account for real physical memory usage
> > > even in cases like fork without exec (which Android uses heavily to share
> > > as many private COW pages as possible between processes), Kernel SamePage
> > > Merging, and clean zero pages.  It produces a measurement of the pages
> > > that only exist in that process (USS, for unique), and a measurement of
> > > the physical memory usage of that process with the cost of shared pages
> > > being evenly split between processes that share them (PSS).
> > >
> > > If all anonymous memory is indistinguishable then figuring out the real
> > > physical memory usage (PSS) of each heap requires either a pagemap walking
> > > tool that can understand the heap debugging of every layer, or for every
> > > layer's heap debugging tools to implement the pagemap walking logic, in
> > > which case it is hard to get a consistent view of memory across the whole
> > > system.
> > >
> > > Tracking the information in userspace leads to all sorts of problems.
> > > It either needs to be stored inside the process, which means every
> > > process has to have an API to export its current heap information upon
> > > request, or it has to be stored externally in a filesystem that
> > > somebody needs to clean up on crashes.  It needs to be readable while
> > > the process is still running, so it has to have some sort of
> > > synchronization with every layer of userspace.  Efficiently tracking
> > > the ranges requires reimplementing something like the kernel vma
> > > trees, and linking to it from every layer of userspace.  It requires
> > > more memory, more syscalls, more runtime cost, and more complexity to
> > > separately track regions that the kernel is already tracking.
> >
> > I understand that the information is currently incoherent, but why is
> > this the right way to make it coherent?  It would seem more useful to
> > use something like one of the tracing mechanisms (eg ftrace, LTTng,
> > whatever the current hotness is in userspace tracing) for the malloc
> > library to log all the useful information, instead of injecting a subset
> > of it into the kernel for userspace to read out again.
>
> Sorry, for the delay with the response. I'm travelling and my internet
> access is very patchy.
>
> Just to clarify, your suggestion is to require userspace to log any
> allocation using ftrace or a similar mechanism and then for the system
> to parse these logs to calculate the memory usage for each process?
> I didn't think much in this direction but I guess logging each
> allocation in the system and periodically collecting that data would
> be quite expensive both from memory usage and performance POV. I'll
> need to think a bit more but these are to me the obvious downsides of
> this approach.

Sorry for the delay again. Now that I'm back there should not be any
more of them.
I thought more about these alternative suggestions for userspace to
record allocations but that would introduce considerable complexity
into userspace. Userspace would have to collect and consolidate this
data by some daemon, all users would have to query it for the data
(IPC or something similar), in case this daemon crashes the data would
need to be somehow recovered. So, in short, it's possible but makes
things much more complex compared to proposed in-kernel
implementation.
OTOH, the only downside of the current implementation is the
additional memory required to store anon vma names. I checked the
memory consumption on the latest Android with these patches and
because we share vma names during fork, the actual memory required to
store vma names is no more than 600kB. Even on older phones like Pixel
3 with 4GB RAM, this is less than 0.015% of total memory. IMHO, this
is an acceptable price to pay.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-30 18:56       ` Suren Baghdasaryan
@ 2021-09-30 23:25         ` Kees Cook
  0 siblings, 0 replies; 20+ messages in thread
From: Kees Cook @ 2021-09-30 23:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Andrew Morton, Colin Cross, Sumit Semwal,
	Michal Hocko, Dave Hansen, Kirill A . Shutemov, Vlastimil Babka,
	Johannes Weiner, Jonathan Corbet, Al Viro, Randy Dunlap,
	Kalesh Singh, Peter Xu, rppt, Peter Zijlstra, Catalin Marinas,
	vincenzo.frascino, Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Thu, Sep 30, 2021 at 11:56:12AM -0700, Suren Baghdasaryan wrote:
> I thought more about these alternative suggestions for userspace to
> record allocations but that would introduce considerable complexity
> into userspace. Userspace would have to collect and consolidate this
> data by some daemon, all users would have to query it for the data
> (IPC or something similar), in case this daemon crashes the data would
> need to be somehow recovered. So, in short, it's possible but makes
> things much more complex compared to proposed in-kernel
> implementation.

Agreed: this is something for the kernel to manage.

> OTOH, the only downside of the current implementation is the
> additional memory required to store anon vma names. I checked the
> memory consumption on the latest Android with these patches and
> because we share vma names during fork, the actual memory required to
> store vma names is no more than 600kB. Even on older phones like Pixel
> 3 with 4GB RAM, this is less than 0.015% of total memory. IMHO, this
> is an acceptable price to pay.

I think that's entirely fine. We don't end up with any GUP games, and
everything is refcounted.

I think a v10 with the various nits fixed would be a good next step
here. What do you think Matthew?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-03 22:28       ` Kees Cook
@ 2021-10-01  3:44         ` Suren Baghdasaryan
  2021-10-01  5:19           ` Kees Cook
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-10-01  3:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Fri, Sep 3, 2021 at 3:28 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Sep 03, 2021 at 02:56:21PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Sep 3, 2021 at 2:47 PM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > (Sorry, a few more things jumped out at me when I looked again...)
> > >
> > > On Thu, Sep 02, 2021 at 04:18:12PM -0700, Suren Baghdasaryan wrote:
> > > > [...]
> > > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > > index 72c7639e3c98..25118902a376 100644
> > > > --- a/kernel/sys.c
> > > > +++ b/kernel/sys.c
> > > > @@ -2299,6 +2299,64 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> > > >
> > > >  #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> > > >
> > > > +#ifdef CONFIG_MMU
> > > > +
> > > > +#define ANON_VMA_NAME_MAX_LEN        256
> > > > +
> > > > +static inline bool is_valid_name_char(char ch)
> > > > +{
> > > > +     /* printable ascii characters, except [ \ ] */
> > > > +     return (ch > 0x1f && ch < 0x5b) || (ch > 0x5d && ch < 0x7f);
> > > > +}
> > >
> > > In the back of my mind, I feel like disallowing backtick would be nice,
> > > but then if $, (, and ) are allowed, it doesn't matter, and that seems
> > > too limiting. :)
> >
> > It's not used by the only current user (Android) and we can always
> > allow more chars later. However going the other direction and
> > disallowing some of them I think would be harder (need to make sure
> > nobody uses them). WDYT if we keep it stricter and relax if needed?
>
> I'd say, if we can also drop each of: ` $ ( )
> then let's do it. Better to keep the obvious shell meta-characters out
> of this, although I don't feel strongly about it. Anything that might
> get confused by this would be similarly confused by binary names too:
>
> $ cat /proc/3407216/maps
> 560bdafd4000-560bdafd6000 r--p 00000000 fd:02 2621909 /tmp/yay`wat
>
> And it's probably easier to change a binary name than to call prctl. :P
>
> I'm good either way. What you have now is great, but if we wanted to be
> extra extra strict, we can add the other 4 above.

While testing v10 I found one case when () are used in the name
"dalvik-main space (region space)". So I can add ` and $ to the
restricted set but not ( and ). Kees, would you be happy with:

static inline bool is_valid_name_char(char ch)
{
    return ch > 0x1f && ch < 0x7f && !strchr("\\`$[]", ch);
}

?

>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-10-01  3:44         ` Suren Baghdasaryan
@ 2021-10-01  5:19           ` Kees Cook
  0 siblings, 0 replies; 20+ messages in thread
From: Kees Cook @ 2021-10-01  5:19 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, Rasmus Villemoes, LKML, linux-fsdevel,
	linux-doc, linux-mm, kernel-team

On Thu, Sep 30, 2021 at 08:44:25PM -0700, Suren Baghdasaryan wrote:
> While testing v10 I found one case when () are used in the name
> "dalvik-main space (region space)". So I can add ` and $ to the
> restricted set but not ( and ). Kees, would you be happy with:
> 
> static inline bool is_valid_name_char(char ch)
> {
>     return ch > 0x1f && ch < 0x7f && !strchr("\\`$[]", ch);
> }
> 
> ?

That works for me! :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
                     ` (2 preceding siblings ...)
  2021-09-06 16:55   ` Matthew Wilcox
@ 2021-10-01  7:01   ` Rasmus Villemoes
  2021-10-01 16:34     ` Suren Baghdasaryan
  3 siblings, 1 reply; 20+ messages in thread
From: Rasmus Villemoes @ 2021-10-01  7:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: ccross, sumit.semwal, mhocko, dave.hansen, keescook, willy,
	kirill.shutemov, vbabka, hannes, corbet, viro, rdunlap,
	kaleshsingh, peterx, rppt, peterz, catalin.marinas,
	vincenzo.frascino, chinwen.chang, axelrasmussen, aarcange, jannh,
	apopple, jhubbard, yuzhao, will, fenghua.yu, thunder.leizhen,
	hughd, feng.tang, jgg, guro, tglx, krisman, chris.hyser, pcc,
	ebiederm, axboe, legion, eb, gorcunov, songmuchun, viresh.kumar,
	thomascedeno, sashal, cxfcosmos, linux, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, kernel-team

On 03/09/2021 01.18, Suren Baghdasaryan wrote:
> From: Colin Cross <ccross@google.com>
> 
> 
> changes in v9
> - Changed max anon vma name length from 64 to 256 (as in the original patch)
> because I found one case of the name length being 139 bytes. If anyone is
> curious, here it is:
> dalvik-/data/dalvik-cache/arm64/apex@com.android.permission@priv-app@GooglePermissionController@GooglePermissionController.apk@classes.art

I'm not sure that's a very convincing argument. We don't add code
arbitrarily just because some userspace code running on some custom
kernel (ab)uses something in that kernel. Surely that user can come up
with a name that doesn't contain GooglePermissionController twice.

The argument for using strings and not just a 128 bit uuid was that it
should (also) be human readable, and 250-byte strings are not that.
Also, there's no natural law forcing this to be some power-of-two, and
in fact the implementation means that it's actually somewhat harmful
(give it a 256 char name, and we'll do a 260 byte alloc, which becomes a
512 byte alloc). So just make the limit 80, the kernel's definition of a
sane line length. As for the allowed chars, it can be relaxed later if
convincing arguments can be made.


> +/* mmap_lock should be read-locked */
> +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> +					 const char *name)
> +{
> +	const char *vma_name = vma_anon_name(vma);
> +
> +	if (likely(!vma_name))
> +		return name == NULL;
> +
> +	return name && !strcmp(name, vma_name);

It's probably preferable to spell this

  /* either both NULL, or pointers to same refcounted string */
  if (vma_name == name)
      return true;

  return name && vma_name && !strcmp(name, vma_name);

so you have one less conditional in the common case.

Rasmus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 2/3] mm: add a field to store names for private anonymous memory
  2021-10-01  7:01   ` Rasmus Villemoes
@ 2021-10-01 16:34     ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2021-10-01 16:34 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Colin Cross, Sumit Semwal, Michal Hocko,
	Dave Hansen, Kees Cook, Matthew Wilcox, Kirill A . Shutemov,
	Vlastimil Babka, Johannes Weiner, Jonathan Corbet, Al Viro,
	Randy Dunlap, Kalesh Singh, Peter Xu, rppt, Peter Zijlstra,
	Catalin Marinas, vincenzo.frascino,
	Chinwen Chang (張錦文),
	Axel Rasmussen, Andrea Arcangeli, Jann Horn, apopple,
	John Hubbard, Yu Zhao, Will Deacon, fenghua.yu, thunder.leizhen,
	Hugh Dickins, feng.tang, Jason Gunthorpe, Roman Gushchin,
	Thomas Gleixner, krisman, chris.hyser, Peter Collingbourne,
	Eric W. Biederman, Jens Axboe, legion, Rolf Eike Beer,
	Cyrill Gorcunov, Muchun Song, Viresh Kumar, Thomas Cedeno,
	sashal, cxfcosmos, LKML, linux-fsdevel, linux-doc, linux-mm,
	kernel-team

On Fri, Oct 1, 2021 at 12:01 AM Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>
> On 03/09/2021 01.18, Suren Baghdasaryan wrote:
> > From: Colin Cross <ccross@google.com>
> >
> >
> > changes in v9
> > - Changed max anon vma name length from 64 to 256 (as in the original patch)
> > because I found one case of the name length being 139 bytes. If anyone is
> > curious, here it is:
> > dalvik-/data/dalvik-cache/arm64/apex@com.android.permission@priv-app@GooglePermissionController@GooglePermissionController.apk@classes.art
>
> I'm not sure that's a very convincing argument. We don't add code
> arbitrarily just because some userspace code running on some custom
> kernel (ab)uses something in that kernel. Surely that user can come up
> with a name that doesn't contain GooglePermissionController twice.
>
> The argument for using strings and not just a 128 bit uuid was that it
> should (also) be human readable, and 250-byte strings are not that.
> Also, there's no natural law forcing this to be some power-of-two, and
> in fact the implementation means that it's actually somewhat harmful
> (give it a 256 char name, and we'll do a 260 byte alloc, which becomes a
> 512 byte alloc). So just make the limit 80, the kernel's definition of a
> sane line length.

Sounds reasonable. I'll set the limit to 80 and will look into the
userspace part if we can trim the names to abide by this limit.

> As for the allowed chars, it can be relaxed later if convincing arguments can be made.

For the disallowed chars, I would like to go with "\\`$[]" set because
of the example I presented in my last reply. Since we disallow $, the
parsers should be able to process parentheses with no issues I think.

>
>
> > +/* mmap_lock should be read-locked */
> > +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> > +                                      const char *name)
> > +{
> > +     const char *vma_name = vma_anon_name(vma);
> > +
> > +     if (likely(!vma_name))
> > +             return name == NULL;
> > +
> > +     return name && !strcmp(name, vma_name);
>
> It's probably preferable to spell this
>
>   /* either both NULL, or pointers to same refcounted string */
>   if (vma_name == name)
>       return true;
>
>   return name && vma_name && !strcmp(name, vma_name);
>
> so you have one less conditional in the common case.

Ack.

>
> Rasmus

Thanks for the review!
Suren.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-10-01 16:34 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 23:18 [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan
2021-09-02 23:18 ` [PATCH v9 2/3] mm: add a field to store names for private anonymous memory Suren Baghdasaryan
2021-09-03 21:35   ` Kees Cook
2021-09-03 21:51     ` Suren Baghdasaryan
2021-09-05 13:04     ` Pavel Machek
2021-09-06 15:52       ` Suren Baghdasaryan
2021-09-03 21:47   ` Kees Cook
2021-09-03 21:56     ` Suren Baghdasaryan
2021-09-03 22:28       ` Kees Cook
2021-10-01  3:44         ` Suren Baghdasaryan
2021-10-01  5:19           ` Kees Cook
2021-09-06 16:55   ` Matthew Wilcox
2021-09-09  4:05     ` Suren Baghdasaryan
2021-09-30 18:56       ` Suren Baghdasaryan
2021-09-30 23:25         ` Kees Cook
2021-10-01  7:01   ` Rasmus Villemoes
2021-10-01 16:34     ` Suren Baghdasaryan
2021-09-02 23:18 ` [PATCH v9 3/3] mm: add anonymous vma name refcounting Suren Baghdasaryan
2021-09-03 22:20   ` Kees Cook
2021-09-03  0:28 ` [PATCH v9 1/3] mm: rearrange madvise code to allow for reuse Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).