linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kees Cook <keescook@chromium.org>
To: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	"Alexey Dobriyan" <adobriyan@gmail.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Mauro Carvalho Chehab" <mchehab+huawei@kernel.org>,
	"Michal Hocko" <mhocko@suse.com>,
	"Colin Cross" <ccross@google.com>,
	"Alexey Gladkov" <gladkov.alexey@gmail.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Michel Lespinasse" <walken@google.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Song Liu" <songliubraving@fb.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Yang Shi" <yang.shi@linux.alibaba.com>,
	chenqiwu <chenqiwu@xiaomi.com>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Mike Christie" <mchristi@redhat.com>,
	"Bart Van Assche" <bvanassche@acm.org>,
	"Amit Pundir" <amit.pundir@linaro.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Christian Brauner" <christian.brauner@ubuntu.com>,
	"Daniel Jordan" <daniel.m.jordan@oracle.com>,
	"Adrian Reber" <areber@redhat.com>,
	"Nicolas Viennot" <Nicolas.Viennot@twosigma.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org,
	"John Stultz" <john.stultz@linaro.org>,
	"Pekka Enberg" <penberg@kernel.org>,
	"Dave Hansen" <dave.hansen@intel.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	"Jan Glauber" <jan.glauber@gmail.com>,
	"Rob Landley" <rob@landley.net>,
	"Cyrill Gorcunov" <gorcunov@openvz.org>,
	"Serge E. Hallyn" <serge.hallyn@ubuntu.com>,
	"David Rientjes" <rientjes@google.com>,
	"Hugh Dickins" <hughd@google.com>,
	"Rik van Riel" <riel@redhat.com>, "Mel Gorman" <mgorman@suse.de>,
	"Tang Chen" <tangchen@cn.fujitsu.com>,
	"Robin Holt" <holt@sgi.com>, "Shaohua Li" <shli@fusionio.com>,
	"Sasha Levin" <sasha.levin@oracle.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Minchan Kim" <minchan@kernel.org>
Subject: Re: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory
Date: Thu, 3 Sep 2020 11:00:36 -0700	[thread overview]
Message-ID: <202009031031.D32EF57ED@keescook> (raw)
In-Reply-To: <20200901161459.11772-4-sumit.semwal@linaro.org>

On Tue, Sep 01, 2020 at 09:44:59PM +0530, Sumit Semwal wrote:
> From: Colin Cross <ccross@google.com>
> 
> In many userspace applications, and especially in VM based applications
> like Android uses heavily, there are multiple different allocators in use.
>  At a minimum there is libc malloc and the stack, and in many cases there
> are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
> multiple VM heaps (one for small objects, one for big objects, etc.).
> Each of these layers usually has its own tools to inspect its usage;
> malloc by compiling a debug version, the VM through heap inspection tools,
> and for direct syscalls there is usually no way to track them.
> 
> On Android we heavily use a set of tools that use an extended version of
> the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
> in userspace and slice their usage by process, shared (COW) vs.  unique
> mappings, backing, etc.  This can account for real physical memory usage
> even in cases like fork without exec (which Android uses heavily to share
> as many private COW pages as possible between processes), Kernel SamePage
> Merging, and clean zero pages.  It produces a measurement of the pages
> that only exist in that process (USS, for unique), and a measurement of
> the physical memory usage of that process with the cost of shared pages
> being evenly split between processes that share them (PSS).
> 
> If all anonymous memory is indistinguishable then figuring out the real
> physical memory usage (PSS) of each heap requires either a pagemap walking
> tool that can understand the heap debugging of every layer, or for every
> layer's heap debugging tools to implement the pagemap walking logic, in
> which case it is hard to get a consistent view of memory across the whole
> system.
> 
> Tracking the information in userspace leads to all sorts of problems.
> It either needs to be stored inside the process, which means every
> process has to have an API to export its current heap information upon
> request, or it has to be stored externally in a filesystem that
> somebody needs to clean up on crashes.  It needs to be readable while
> the process is still running, so it has to have some sort of
> synchronization with every layer of userspace.  Efficiently tracking
> the ranges requires reimplementing something like the kernel vma
> trees, and linking to it from every layer of userspace.  It requires
> more memory, more syscalls, more runtime cost, and more complexity to
> separately track regions that the kernel is already tracking.
> 
> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
> userspace-provided name for anonymous vmas.  The names of named anonymous
> vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].
> 
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> Setting the name to NULL clears it.
> 
> The name is stored in a user pointer in the shared union in vm_area_struct
> that points to a null terminated string inside the user process.  vmas
> that point to the same address and are otherwise mergeable will be merged,
> but vmas that point to equivalent strings at different addresses will not
> be merged.
> 
> The idea to store a userspace pointer to reduce the complexity within mm
> (at the expense of the complexity of reading /proc/pid/mem) came from Dave
> Hansen.  This results in no runtime overhead in the mm subsystem other
> than comparing the anon_name pointers when considering vma merging.  The
> pointer is stored in a union with fields that are only used on file-backed
> mappings, so it does not increase memory usage.
> (Upstream changed to remove the union, so this patch adds it back as well)

This is, for me, the weirdest part of the series. Very little in the
kernel does stuff like this, and given the Fun(tm) we've had with things
like userfaultfd, I cannot begin to imagine what it'd be nice to have
this kind of control from a process over what is reading its maps file.
(i.e. stalling a maps reader by installing a userfaultfd for the
anon_name: that'd be a DoS for anything trying to read the maps file,
etc.)

Why is a kernel-copied string insufficient for this? I don't think VMA
merging is a fast-path operation, so doing a strcmp isn't going to wreck
anything...

Let me try to find the earlier thread with Dave Hansen... okay, found it:
https://lore.kernel.org/linux-mm/51DDF071.5000309@intel.com/

Right, so, this idea predates userfaultfd. :)

More notes below, but I *really* think this should not be a userspace
pointer. And since a separate union has been found, let's just do a
strndup_user() for the name, validate it as containing only printable
characters without \n \r \v \f and move the merging logic into a
separate patch.

> 
> Signed-off-by: Colin Cross <ccross@google.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Jan Glauber <jan.glauber@gmail.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Rob Landley <rob@landley.net>
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: Robin Holt <holt@sgi.com>
> Cc: Shaohua Li <shli@fusionio.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
> 
> ---
> 
> v2: updates the commit message to explain in more detail why the
>     patch is useful.
> v3: renames vma_get_anon_name to vma_anon_name
>     replaces logic in seq_print_vma_name with access_process_vm
>     removes Name: entry from smaps, it's already on the header line
>     changes the prctl option number to match what is currently in
>        use on Android
> v4: adds paragraph to commit log on why this is better than tracking
>        in userspace
>     squashes fixes from Andrew Morton to fix build error and warning
>     fix build error reported by Mark Salter when !CONFIG_MMU
> v5: rebased to v5.9-rc1, added minor fixes to match upstream
> v6: rebased to v5.9-rc3, and addressed review comments:
>        - added missing callers in fs/userfaultd.c
>        - simplified the union
>        - use the new access_remote_vm_locked() in show_map_vma()
>          since that already holds mmap_lock
> v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't
>       defined

(In the future, can you link to lore URLs for the earlier versions? It
makes it easier to do research on past threads...)

> ---
>  Documentation/filesystems/proc.rst |  2 ++
>  fs/proc/task_mmu.c                 | 24 ++++++++++++-
>  fs/userfaultfd.c                   |  7 ++--
>  include/linux/mm.h                 | 12 ++++++-
>  include/linux/mm_types.h           | 25 +++++++++++---
>  include/uapi/linux/prctl.h         |  3 ++
>  kernel/sys.c                       | 32 ++++++++++++++++++
>  mm/interval_tree.c                 |  2 +-
>  mm/madvise.c                       | 54 +++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |  3 +-
>  mm/mlock.c                         |  2 +-
>  mm/mmap.c                          | 38 ++++++++++++---------
>  mm/mprotect.c                      |  2 +-
>  13 files changed, 173 insertions(+), 33 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 533c79e8d2cd..41a9cea73b8b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -429,6 +429,8 @@ is not associated with a file:
>   [stack]                    the stack of the main process
>   [vdso]                     the "virtual dynamic shared object",
>                              the kernel system call handler
> +[anon:<name>]               an anonymous mapping that has been

typo: needs a single leading space here.

> +                            named by userspace
>   =======                    ====================================
>  
>   or if empty, the mapping is anonymous.
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5066b0251ed8..290ce5ecad0d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -97,6 +97,21 @@ unsigned long task_statm(struct mm_struct *mm,
>  	return mm->total_vm;
>  }
>  
> +static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	char anon_name[NAME_MAX + 1];
> +	int n;
> +
> +	n = access_remote_vm_locked(mm, (unsigned long)vma_anon_name(vma), anon_name,
> +				    NAME_MAX, 0);
> +	if (n > 0) {
> +		seq_puts(m, "[anon:");
> +		seq_write(m, anon_name, strnlen(anon_name, n));
> +		seq_putc(m, ']');

seq_printf(m, "[anon:%s]", anon_name); ?

> +	}
> +}
> [...]
> +#ifdef CONFIG_ADVISE_SYSCALLS
> +int madvise_set_anon_name(unsigned long start, unsigned long len_in,
> +			  unsigned long name_addr);
> +#else
> +static inline int madvise_set_anon_name(unsigned long start, unsigned long len_in,
> +					unsigned long name_addr) {
> +	return 0;

Why not -EINVAL?

> +}
> +#endif
> [...]
> +static int madvise_vma_anon_name(struct vm_area_struct *vma,
> +				 struct vm_area_struct **prev,
> +				 unsigned long start, unsigned long end,
> +				 unsigned long name_addr)
> +{
> +	int error;
> +
> +	/* Only anonymous mappings can be named */
> +	if (vma->vm_file)
> +		return -EINVAL;

But a process could fork, and then change the contents of the vma_name
address in one process, showing different names for the same VMA in
maps? O_o That feels like an unexpected result -- the VMA should have
the same name in either process.

> [...]
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  				struct file *file, unsigned long vm_flags,
> -				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> +				struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> +				const char __user *anon_name)

I think at the very least, this patch should be split in half: first add
the feature, and then add the ability to merge.

-- 
Kees Cook


  parent reply	other threads:[~2020-09-03 18:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-01 16:14 [PATCH v7 0/3] Anonymous VMA naming patches Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant Sumit Semwal
2020-09-01 16:14 ` [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Sumit Semwal
2020-09-03 13:25   ` Kirill A. Shutemov
2020-09-03 13:43     ` Matthew Wilcox
2020-09-03 13:58       ` Kirill A. Shutemov
2020-09-03 15:59         ` Colin Cross
2020-09-03 17:31           ` Kees Cook
2020-09-03 17:54             ` Colin Cross
2020-09-03 18:00               ` Dave Hansen
2020-09-03 18:00   ` Kees Cook [this message]
2020-09-03 18:09     ` Dave Hansen
2020-09-03 18:26       ` Colin Cross
2020-09-03 18:40         ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202009031031.D32EF57ED@keescook \
    --to=keescook@chromium.org \
    --cc=Nicolas.Viennot@twosigma.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=amit.pundir@linaro.org \
    --cc=areber@redhat.com \
    --cc=bvanassche@acm.org \
    --cc=ccross@google.com \
    --cc=chenqiwu@xiaomi.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=corbet@lwn.net \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave.hansen@intel.com \
    --cc=ebiederm@xmission.com \
    --cc=gladkov.alexey@gmail.com \
    --cc=gorcunov@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=holt@sgi.com \
    --cc=hughd@google.com \
    --cc=jan.glauber@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=john.stultz@linaro.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mchehab+huawei@kernel.org \
    --cc=mchristi@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=oleg@redhat.com \
    --cc=penberg@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=rob@landley.net \
    --cc=sasha.levin@oracle.com \
    --cc=serge.hallyn@ubuntu.com \
    --cc=shli@fusionio.com \
    --cc=songliubraving@fb.com \
    --cc=sumit.semwal@linaro.org \
    --cc=tangchen@cn.fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=walken@google.com \
    --cc=willy@infradead.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).