From: Michal Hocko <mhocko@suse.com>
To: Kalesh Singh <kaleshsingh@google.com>
Cc: linux-doc@vger.kernel.org,
"Szabolcs Nagy" <szabolcs.nagy@arm.com>,
dri-devel@lists.freedesktop.org,
"Andrei Vagin" <avagin@gmail.com>, "Hui Su" <sh_def@163.com>,
"Michel Lespinasse" <walken@google.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Mauro Carvalho Chehab" <mchehab+huawei@kernel.org>,
"Daniel Jordan" <daniel.m.jordan@oracle.com>,
kernel-team@android.com, "Alexey Dobriyan" <adobriyan@gmail.com>,
linux-media@vger.kernel.org, "Kees Cook" <keescook@chromium.org>,
jannh@google.com, linaro-mm-sig@lists.linaro.org,
linux-fsdevel@vger.kernel.org,
"Bernd Edlinger" <bernd.edlinger@hotmail.de>,
surenb@google.com, "Alexey Gladkov" <gladkov.alexey@gmail.com>,
linux-kernel@vger.kernel.org, minchan@kernel.org,
"Yafang Shao" <laoar.shao@gmail.com>,
"Eric W. Biederman" <ebiederm@xmission.com>,
gregkh@linuxfoundation.org, hridya@google.com,
"Andrew Morton" <akpm@linux-foundation.org>,
linux-api@vger.kernel.org,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds
Date: Wed, 27 Jan 2021 10:05:26 +0100 [thread overview]
Message-ID: <20210127090526.GB827@dhcp22.suse.cz> (raw)
In-Reply-To: <20210126225138.1823266-1-kaleshsingh@google.com>
[Cc linux-api as this is a new user interface]
On Tue 26-01-21 22:51:28, Kalesh Singh wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.
>
> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
> 1. Do a readlink on each FD.
> 2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
> 3. stat the file to get the dmabuf inode number.
> 4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
>
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
>
> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> privileges, this approach is not suitable for production builds. Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.
>
> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
>
> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
>
> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
>
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
>
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)
>
> Performance Comparison:
> -----------------------
>
> The following data compares the time to capture the sizes of all DMA
> buffers referenced by FDs for all processes on an arm64 android device.
>
> -------------------------------------------------------
> | Core 0 (Little) | Core 7 (Big) |
> -------------------------------------------------------
> >From <pid>/fdinfo | 318 ms | 145 ms |
> -------------------------------------------------------
> Inodes from | 114 ms | 27 ms |
> dmabuf_fds; | (2.8x ^) | (5.4x ^) |
> data from sysfs | | |
> -------------------------------------------------------
>
> It can be inferred that in the worst case there is a 2.8x speedup for
> determining per-process DMA buffer FD sizes, when using the proposed
> interfaces.
>
> [1] https://lore.kernel.org/dri-devel/20210119225723.388883-1-hridya@google.com/
>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
> Documentation/filesystems/proc.rst | 30 ++++++
> drivers/dma-buf/dma-buf.c | 7 +-
> fs/proc/Makefile | 1 +
> fs/proc/base.c | 1 +
> fs/proc/dma_bufs.c | 159 +++++++++++++++++++++++++++++
> fs/proc/internal.h | 1 +
> include/linux/dma-buf.h | 5 +
> 7 files changed, 198 insertions(+), 6 deletions(-)
> create mode 100644 fs/proc/dma_bufs.c
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2fa69f710e2a..757dd47ab679 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
> 3.10 /proc/<pid>/timerslack_ns - Task timerslack value
> 3.11 /proc/<pid>/patch_state - Livepatch patch operation state
> 3.12 /proc/<pid>/arch_status - Task architecture specific information
> + 3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
>
> 4 Configuring procfs
> 4.1 Mount options
> @@ -2131,6 +2132,35 @@ AVX512_elapsed_ms
> the task is unlikely an AVX512 user, but depends on the workload and the
> scheduling scenario, it also could be a false negative mentioned above.
>
> +3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
> +-------------------------------------------------------------------------
> +This file exposes a list of the inode numbers for every DMA buffer
> +FD that the task has.
> +
> +The same inode number can appear more than once, indicating the total
> +FD references for the associated DMA buffer.
> +
> +The inode number can be used to lookup the DMA buffer information in
> +the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
> +
> +Example Output
> +~~~~~~~~~~~~~~
> +$ cat /proc/612/task/612/dmabuf_fds
> +30972 30973 45678 49326
> +
> +Permission to access this file is governed by a ptrace access mode
> +PTRACE_MODE_READ_FSCREDS.
> +
> +Threads can have different files when created without specifying
> +the CLONE_FILES flag. For this reason the interface is presented as
> +/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
> +This simplifies kernel code and aggregation can be handled in
> +userspace.
> +
> +If a thread has the same files as its group leader, then its dmabuf_fds
> +file will be empty as these dmabufs are already reported by the
> +group leader.
> +
> Chapter 4: Configuring procfs
> =============================
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 9ad6397aaa97..0660c06be4c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -29,8 +29,6 @@
> #include <uapi/linux/dma-buf.h>
> #include <uapi/linux/magic.h>
>
> -static inline int is_dma_buf_file(struct file *);
> -
> struct dma_buf_list {
> struct list_head head;
> struct mutex lock;
> @@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = {
> .show_fdinfo = dma_buf_show_fdinfo,
> };
>
> -/*
> - * is_dma_buf_file - Check if struct file* is associated with dma_buf
> - */
> -static inline int is_dma_buf_file(struct file *file)
> +int is_dma_buf_file(struct file *file)
> {
> return file->f_op == &dma_buf_fops;
> }
> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
> index bd08616ed8ba..91a67f43ddf4 100644
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -16,6 +16,7 @@ proc-y += cmdline.o
> proc-y += consoles.o
> proc-y += cpuinfo.o
> proc-y += devices.o
> +proc-y += dma_bufs.o
> proc-y += interrupts.o
> proc-y += loadavg.o
> proc-y += meminfo.o
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b3422cda2a91..af15a60b9831 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = {
> #ifdef CONFIG_SECCOMP_CACHE_DEBUG
> ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
> #endif
> + REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
> };
>
> static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
> diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
> new file mode 100644
> index 000000000000..46ea9cf968ed
> --- /dev/null
> +++ b/fs/proc/dma_bufs.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Per-process DMA-BUF Stats
> + *
> + * Copyright (C) 2021 Google LLC.
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/fdtable.h>
> +#include <linux/ptrace.h>
> +#include <linux/seq_file.h>
> +
> +#include "internal.h"
> +
> +struct dmabuf_fds_private {
> + struct inode *inode;
> + struct task_struct *task;
> + struct file *dmabuf_file;
> +};
> +
> +static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
> + loff_t *pos)
> +{
> + struct fdtable *fdt;
> + struct file *file;
> +
> + rcu_read_lock();
> + fdt = files_fdtable(priv->task->files);
> + for (; *pos < fdt->max_fds; ++*pos) {
> + file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
> + if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
> + priv->dmabuf_file = file;
> + break;
> + }
> + }
> + if (*pos >= fdt->max_fds)
> + pos = NULL;
> + rcu_read_unlock();
> +
> + return pos;
> +}
> +
> +static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
> +{
> + struct dmabuf_fds_private *priv = s->private;
> + struct files_struct *group_leader_files;
> +
> + priv->task = get_proc_task(priv->inode);
> +
> + if (!priv->task)
> + return ERR_PTR(-ESRCH);
> +
> + /* Hold task lock for duration that files need to be stable */
> + task_lock(priv->task);
> +
> + /*
> + * If this task is not the group leader but shares the same files, leave file empty.
> + * These dmabufs are already reported in the group leader's dmabuf_fds.
> + */
> + group_leader_files = priv->task->group_leader->files;
> + if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
> + task_unlock(priv->task);
> + put_task_struct(priv->task);
> + priv->task = NULL;
> + return NULL;
> + }
> +
> + return next_dmabuf(priv, pos);
> +}
> +
> +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
> +{
> + ++*pos;
> + return next_dmabuf(s->private, pos);
> +}
> +
> +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
> +{
> + struct dmabuf_fds_private *priv = s->private;
> +
> + if (priv->task) {
> + task_unlock(priv->task);
> + put_task_struct(priv->task);
> +
> + }
> + if (priv->dmabuf_file)
> + fput(priv->dmabuf_file);
> +}
> +
> +static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
> +{
> + struct dmabuf_fds_private *priv = s->private;
> + struct file *file = priv->dmabuf_file;
> + struct dma_buf *dmabuf = file->private_data;
> +
> + if (!dmabuf)
> + return -ESRCH;
> +
> + seq_printf(s, "%8lu ", file_inode(file)->i_ino);
> +
> + fput(priv->dmabuf_file);
> + priv->dmabuf_file = NULL;
> +
> + return 0;
> +}
> +
> +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
> + .start = dmabuf_fds_seq_start,
> + .next = dmabuf_fds_seq_next,
> + .stop = dmabuf_fds_seq_stop,
> + .show = dmabuf_fds_seq_show
> +};
> +
> +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
> + const struct seq_operations *ops)
> +{
> + struct dmabuf_fds_private *priv;
> + struct task_struct *task;
> + bool allowed = false;
> +
> + task = get_proc_task(inode);
> + if (!task)
> + return -ESRCH;
> +
> + allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> + put_task_struct(task);
> +
> + if (!allowed)
> + return -EACCES;
> +
> + priv = __seq_open_private(file, ops, sizeof(*priv));
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->inode = inode;
> + priv->task = NULL;
> + priv->dmabuf_file = NULL;
> +
> + return 0;
> +}
> +
> +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
> +{
> + return seq_release_private(inode, file);
> +}
> +
> +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
> +{
> + return proc_dmabuf_fds_open(inode, file,
> + &proc_tid_dmabuf_fds_seq_ops);
> +}
> +
> +const struct file_operations proc_tid_dmabuf_fds_operations = {
> + .open = tid_dmabuf_fds_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = proc_dmabuf_fds_release,
> +};
> +
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index f60b379dcdc7..4ca74220db9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> extern const struct file_operations proc_pid_smaps_rollup_operations;
> extern const struct file_operations proc_clear_refs_operations;
> extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_tid_dmabuf_fds_operations;
>
> extern unsigned long task_vsize(struct mm_struct *);
> extern unsigned long task_statm(struct mm_struct *,
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index cf72699cb2bc..087e11f7f193 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -27,6 +27,11 @@ struct device;
> struct dma_buf;
> struct dma_buf_attachment;
>
> +/**
> + * Check if struct file* is associated with dma_buf.
> + */
> +int is_dma_buf_file(struct file *file);
> +
> /**
> * struct dma_buf_ops - operations possible on struct dma_buf
> * @vmap: [optional] creates a virtual mapping for the buffer into kernel
> --
> 2.30.0.280.ga3ce27912f-goog
--
Michal Hocko
SUSE Labs
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
next prev parent reply other threads:[~2021-01-28 8:18 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-26 22:51 [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds Kalesh Singh
2021-01-27 9:05 ` Michal Hocko [this message]
2021-01-27 10:53 ` Christian König
2021-01-27 11:02 ` Michal Hocko
2021-01-27 11:08 ` Christian König
2021-01-27 11:23 ` Michal Hocko
2021-01-27 10:47 ` Jann Horn
2021-01-27 10:57 ` Michal Hocko
2021-01-27 11:01 ` Christian König
2021-01-27 11:27 ` Michal Hocko
2021-01-28 10:01 ` Pekka Paalanen
2021-01-29 12:18 ` Christian König
2021-01-29 14:13 ` Pekka Paalanen
2021-01-29 14:17 ` Simon Ser
2021-01-29 14:22 ` Christian König
2021-02-03 10:23 ` Daniel Vetter
2021-01-27 17:16 ` Kalesh Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210127090526.GB827@dhcp22.suse.cz \
--to=mhocko@suse.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=avagin@gmail.com \
--cc=bernd.edlinger@hotmail.de \
--cc=christian.koenig@amd.com \
--cc=corbet@lwn.net \
--cc=daniel.m.jordan@oracle.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=ebiederm@xmission.com \
--cc=gladkov.alexey@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=hridya@google.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=keescook@chromium.org \
--cc=kernel-team@android.com \
--cc=laoar.shao@gmail.com \
--cc=linaro-mm-sig@lists.linaro.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=mchehab+huawei@kernel.org \
--cc=minchan@kernel.org \
--cc=sh_def@163.com \
--cc=surenb@google.com \
--cc=szabolcs.nagy@arm.com \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).