linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 dma-buf 0/3] Improve the dma-buf tracking
@ 2019-03-22  2:51 Chenbo Feng
  2019-03-22  2:51 ` [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode Chenbo Feng
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Chenbo Feng @ 2019-03-22  2:51 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-media
  Cc: kernel-team, Sumit Semwal, erickreyes, Daniel Vetter, Chenbo Feng

Currently, all dma-bufs share the same anonymous inode. While we can count
how many dma-buf fds or mappings a process has, we can't get the size of
the backing buffers or tell if two entries point to the same dma-buf. And
in debugfs, we can get a per-buffer breakdown of size and reference count,
but can't tell which processes are actually holding the references to each
buffer.

To resolve the issue above and provide better method for userspace to track
the dma-buf usage across different processes, the following changes are
proposed in dma-buf kernel side. First of all, replace the singleton inode
inside the dma-buf subsystem with a mini-filesystem, and assign each
dma-buf a unique inode out of this filesystem.  With this change, calling
stat(2) on each entry gives the caller a unique ID (st_ino), the buffer's
size (st_size), and even the number of pages assigned to each dma-buffer.
Secoundly, add the inode information to /sys/kernel/debug/dma_buf/bufinfo
so in the case where a buffer is mmap()ed into a process’s address space
but all remaining fds have been closed, we can still get the dma-buf
information and try to accociate it with the process by searching the
proc/pid/maps and looking for the corresponding inode number exposed in
dma-buf debug fs. Thirdly, created an ioctl to assign names to dma-bufs
which lets userspace assign short names (e.g., "CAMERA") to buffers. This
information can be extremely helpful for tracking and accounting shared
buffers based on their usage and original purpose. Last but not least, add
dma-buf information to /proc/pid/fdinfo by adding a show_fdinfo() handler
to dma_buf_file_operations. The handler will print the file_count and name
of each buffer.

Change in v2:
* Add a check to prevent changing dma-buf name when it is attached to
  devices.
* Fixed some compile warnings

Greg Hackmann (3):
  dma-buf: give each buffer a full-fledged inode
  dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls
  dma-buf: add show_fdinfo handler

 drivers/dma-buf/dma-buf.c    | 121 ++++++++++++++++++++++++++++++++---
 include/linux/dma-buf.h      |   5 +-
 include/uapi/linux/dma-buf.h |   4 ++
 include/uapi/linux/magic.h   |   1 +
 4 files changed, 122 insertions(+), 9 deletions(-)

-- 
2.21.0.rc2.261.ga7da99ff1b-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-22  2:51 [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Chenbo Feng
@ 2019-03-22  2:51 ` Chenbo Feng
  2019-03-22 15:02   ` Joel Fernandes
  2019-03-22  2:51 ` [RFC v2 2/3] dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls Chenbo Feng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Chenbo Feng @ 2019-03-22  2:51 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-media
  Cc: kernel-team, Sumit Semwal, erickreyes, Daniel Vetter,
	Greg Hackmann, Chenbo Feng

From: Greg Hackmann <ghackmann@google.com>

By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
can get a lot of fine-grained data about how shmem buffers are shared
among processes.  stat(2) on each entry gives the caller a unique
ID (st_ino), the buffer's size (st_size), and even the number of pages
currently charged to the buffer (st_blocks / 512).

In contrast, all dma-bufs share the same anonymous inode.  So while we
can count how many dma-buf fds or mappings a process has, we can't get
the size of the backing buffers or tell if two entries point to the same
dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
size and reference count, but can't tell which processes are actually
holding the references to each buffer.

Replace the singleton inode with full-fledged inodes allocated by
alloc_anon_inode().  This involves creating and mounting a
mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.

Signed-off-by: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
---
 drivers/dma-buf/dma-buf.c  | 63 ++++++++++++++++++++++++++++++++++----
 include/uapi/linux/magic.h |  1 +
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7c858020d14b..ffd5a2ad7d6f 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -34,8 +34,10 @@
 #include <linux/poll.h>
 #include <linux/reservation.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 #include <uapi/linux/dma-buf.h>
+#include <uapi/linux/magic.h>
 
 static inline int is_dma_buf_file(struct file *);
 
@@ -46,6 +48,25 @@ struct dma_buf_list {
 
 static struct dma_buf_list db_list;
 
+static const struct dentry_operations dma_buf_dentry_ops = {
+	.d_dname = simple_dname,
+};
+
+static struct vfsmount *dma_buf_mnt;
+
+static struct dentry *dma_buf_fs_mount(struct file_system_type *fs_type,
+		int flags, const char *name, void *data)
+{
+	return mount_pseudo(fs_type, "dmabuf:", NULL, &dma_buf_dentry_ops,
+			DMA_BUF_MAGIC);
+}
+
+static struct file_system_type dma_buf_fs_type = {
+	.name = "dmabuf",
+	.mount = dma_buf_fs_mount,
+	.kill_sb = kill_anon_super,
+};
+
 static int dma_buf_release(struct inode *inode, struct file *file)
 {
 	struct dma_buf *dmabuf;
@@ -338,6 +359,31 @@ static inline int is_dma_buf_file(struct file *file)
 	return file->f_op == &dma_buf_fops;
 }
 
+static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
+{
+	struct file *file;
+	struct inode *inode = alloc_anon_inode(dma_buf_mnt->mnt_sb);
+
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	inode->i_size = dmabuf->size;
+	inode_set_bytes(inode, dmabuf->size);
+
+	file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
+				 flags, &dma_buf_fops);
+	if (IS_ERR(file))
+		goto err_alloc_file;
+	file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
+	file->private_data = dmabuf;
+
+	return file;
+
+err_alloc_file:
+	iput(inode);
+	return file;
+}
+
 /**
  * DOC: dma buf device access
  *
@@ -433,8 +479,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	}
 	dmabuf->resv = resv;
 
-	file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf,
-					exp_info->flags);
+	file = dma_buf_getfile(dmabuf, exp_info->flags);
 	if (IS_ERR(file)) {
 		ret = PTR_ERR(file);
 		goto err_dmabuf;
@@ -1025,8 +1070,8 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
 		return ret;
 
 	seq_puts(s, "\nDma-buf Objects:\n");
-	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\n",
-		   "size", "flags", "mode", "count");
+	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\t%-8s\n",
+		   "size", "flags", "mode", "count", "ino");
 
 	list_for_each_entry(buf_obj, &db_list.head, list_node) {
 		ret = mutex_lock_interruptible(&buf_obj->lock);
@@ -1037,11 +1082,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
 			continue;
 		}
 
-		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\n",
+		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\n",
 				buf_obj->size,
 				buf_obj->file->f_flags, buf_obj->file->f_mode,
 				file_count(buf_obj->file),
-				buf_obj->exp_name);
+				buf_obj->exp_name,
+				file_inode(buf_obj->file)->i_ino);
 
 		robj = buf_obj->resv;
 		while (true) {
@@ -1136,6 +1182,10 @@ static inline void dma_buf_uninit_debugfs(void)
 
 static int __init dma_buf_init(void)
 {
+	dma_buf_mnt = kern_mount(&dma_buf_fs_type);
+	if (IS_ERR(dma_buf_mnt))
+		return PTR_ERR(dma_buf_mnt);
+
 	mutex_init(&db_list.lock);
 	INIT_LIST_HEAD(&db_list.head);
 	dma_buf_init_debugfs();
@@ -1146,5 +1196,6 @@ subsys_initcall(dma_buf_init);
 static void __exit dma_buf_deinit(void)
 {
 	dma_buf_uninit_debugfs();
+	kern_unmount(dma_buf_mnt);
 }
 __exitcall(dma_buf_deinit);
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f8c00045d537..665e18627f78 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -91,5 +91,6 @@
 #define UDF_SUPER_MAGIC		0x15013346
 #define BALLOON_KVM_MAGIC	0x13661366
 #define ZSMALLOC_MAGIC		0x58295829
+#define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 
 #endif /* __LINUX_MAGIC_H__ */
-- 
2.21.0.225.g810b269d1ac-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC v2 2/3] dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls
  2019-03-22  2:51 [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Chenbo Feng
  2019-03-22  2:51 ` [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode Chenbo Feng
@ 2019-03-22  2:51 ` Chenbo Feng
  2019-03-22  2:51 ` [RFC v2 3/3] dma-buf: add show_fdinfo handler Chenbo Feng
  2019-03-22 12:29 ` [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Joel Fernandes
  3 siblings, 0 replies; 11+ messages in thread
From: Chenbo Feng @ 2019-03-22  2:51 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-media
  Cc: kernel-team, Sumit Semwal, erickreyes, Daniel Vetter,
	Greg Hackmann, Chenbo Feng

From: Greg Hackmann <ghackmann@google.com>

This patch adds complimentary DMA_BUF_SET_NAME and DMA_BUF_GET_NAME
ioctls, which lets userspace processes attach a free-form name to each
buffer.

This information can be extremely helpful for tracking and accounting
shared buffers.  For example, on Android, we know what each buffer will
be used for at allocation time: GL, multimedia, camera, etc.  The
userspace allocator can use DMA_BUF_SET_NAME to associate that
information with the buffer, so we can later give developers a
breakdown of how much memory they're allocating for graphics, camera,
etc.

Signed-off-by: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
---
 drivers/dma-buf/dma-buf.c    | 48 ++++++++++++++++++++++++++++++++++--
 include/linux/dma-buf.h      |  5 +++-
 include/uapi/linux/dma-buf.h |  4 +++
 3 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ffd5a2ad7d6f..f5e8d4fab950 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -297,6 +297,42 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
 	return events;
 }
 
+static long dma_buf_set_name(struct dma_buf *dmabuf, const char __user *buf)
+{
+	char *name = strndup_user(buf, DMA_BUF_NAME_LEN);
+	long ret = 0;
+
+	if (IS_ERR(name))
+		return PTR_ERR(name);
+
+	mutex_lock(&dmabuf->lock);
+	if (!list_empty(&dmabuf->attachments)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+	kfree(dmabuf->name);
+	dmabuf->name = name;
+
+out_unlock:
+	mutex_unlock(&dmabuf->lock);
+	return ret;
+}
+
+static long dma_buf_get_name(struct dma_buf *dmabuf, char __user *buf)
+{
+	const char *name = "";
+	long ret = 0;
+
+	mutex_lock(&dmabuf->lock);
+	if (dmabuf->name)
+		name = dmabuf->name;
+	if (copy_to_user(buf, name, strlen(name) + 1))
+		ret = -EFAULT;
+	mutex_unlock(&dmabuf->lock);
+
+	return ret;
+}
+
 static long dma_buf_ioctl(struct file *file,
 			  unsigned int cmd, unsigned long arg)
 {
@@ -335,6 +371,13 @@ static long dma_buf_ioctl(struct file *file,
 			ret = dma_buf_begin_cpu_access(dmabuf, direction);
 
 		return ret;
+
+	case DMA_BUF_SET_NAME:
+		return dma_buf_set_name(dmabuf, (const char __user *)arg);
+
+	case DMA_BUF_GET_NAME:
+		return dma_buf_get_name(dmabuf, (char __user *)arg);
+
 	default:
 		return -ENOTTY;
 	}
@@ -1082,12 +1125,13 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
 			continue;
 		}
 
-		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\n",
+		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
 				buf_obj->size,
 				buf_obj->file->f_flags, buf_obj->file->f_mode,
 				file_count(buf_obj->file),
 				buf_obj->exp_name,
-				file_inode(buf_obj->file)->i_ino);
+				file_inode(buf_obj->file)->i_ino,
+				buf_obj->name ?: "");
 
 		robj = buf_obj->resv;
 		while (true) {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 58725f890b5b..582998e19df6 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -255,10 +255,12 @@ struct dma_buf_ops {
  * @file: file pointer used for sharing buffers across, and for refcounting.
  * @attachments: list of dma_buf_attachment that denotes all devices attached.
  * @ops: dma_buf_ops associated with this buffer object.
- * @lock: used internally to serialize list manipulation, attach/detach and vmap/unmap
+ * @lock: used internally to serialize list manipulation, attach/detach and
+ *        vmap/unmap, and accesses to name
  * @vmapping_counter: used internally to refcnt the vmaps
  * @vmap_ptr: the current vmap ptr if vmapping_counter > 0
  * @exp_name: name of the exporter; useful for debugging.
+ * @name: userspace-provided name; useful for accounting and debugging.
  * @owner: pointer to exporter module; used for refcounting when exporter is a
  *         kernel module.
  * @list_node: node for dma_buf accounting and debugging.
@@ -286,6 +288,7 @@ struct dma_buf {
 	unsigned vmapping_counter;
 	void *vmap_ptr;
 	const char *exp_name;
+	const char *name;
 	struct module *owner;
 	struct list_head list_node;
 	void *priv;
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
index d75df5210a4a..4e9c5fe7aecd 100644
--- a/include/uapi/linux/dma-buf.h
+++ b/include/uapi/linux/dma-buf.h
@@ -35,7 +35,11 @@ struct dma_buf_sync {
 #define DMA_BUF_SYNC_VALID_FLAGS_MASK \
 	(DMA_BUF_SYNC_RW | DMA_BUF_SYNC_END)
 
+#define DMA_BUF_NAME_LEN	32
+
 #define DMA_BUF_BASE		'b'
 #define DMA_BUF_IOCTL_SYNC	_IOW(DMA_BUF_BASE, 0, struct dma_buf_sync)
+#define DMA_BUF_SET_NAME	_IOW(DMA_BUF_BASE, 1, const char *)
+#define DMA_BUF_GET_NAME	_IOR(DMA_BUF_BASE, 2, char *)
 
 #endif
-- 
2.21.0.225.g810b269d1ac-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC v2 3/3] dma-buf: add show_fdinfo handler
  2019-03-22  2:51 [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Chenbo Feng
  2019-03-22  2:51 ` [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode Chenbo Feng
  2019-03-22  2:51 ` [RFC v2 2/3] dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls Chenbo Feng
@ 2019-03-22  2:51 ` Chenbo Feng
  2019-03-22 12:29 ` [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Joel Fernandes
  3 siblings, 0 replies; 11+ messages in thread
From: Chenbo Feng @ 2019-03-22  2:51 UTC (permalink / raw)
  To: linux-kernel, dri-devel, linux-media
  Cc: kernel-team, Sumit Semwal, erickreyes, Daniel Vetter,
	Greg Hackmann, Chenbo Feng

From: Greg Hackmann <ghackmann@google.com>

The show_fdinfo handler exports the same information available through
debugfs on a per-buffer basis.

Signed-off-by: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Chenbo Feng <fengc@google.com>
---
 drivers/dma-buf/dma-buf.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f5e8d4fab950..fc7be2939ba1 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -383,6 +383,20 @@ static long dma_buf_ioctl(struct file *file,
 	}
 }
 
+static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
+{
+	struct dma_buf *dmabuf = file->private_data;
+
+	seq_printf(m, "size:\t%zu\n", dmabuf->size);
+	/* Don't count the temporary reference taken inside procfs seq_show */
+	seq_printf(m, "count:\t%ld\n", file_count(dmabuf->file) - 1);
+	seq_printf(m, "exp_name:\t%s\n", dmabuf->exp_name);
+	mutex_lock(&dmabuf->lock);
+	if (dmabuf->name)
+		seq_printf(m, "name:\t%s\n", dmabuf->name);
+	mutex_unlock(&dmabuf->lock);
+}
+
 static const struct file_operations dma_buf_fops = {
 	.release	= dma_buf_release,
 	.mmap		= dma_buf_mmap_internal,
@@ -392,6 +406,7 @@ static const struct file_operations dma_buf_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= dma_buf_ioctl,
 #endif
+	.show_fdinfo	= dma_buf_show_fdinfo,
 };
 
 /*
-- 
2.21.0.225.g810b269d1ac-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC v2 dma-buf 0/3] Improve the dma-buf tracking
  2019-03-22  2:51 [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Chenbo Feng
                   ` (2 preceding siblings ...)
  2019-03-22  2:51 ` [RFC v2 3/3] dma-buf: add show_fdinfo handler Chenbo Feng
@ 2019-03-22 12:29 ` Joel Fernandes
  3 siblings, 0 replies; 11+ messages in thread
From: Joel Fernandes @ 2019-03-22 12:29 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: linux-kernel, dri-devel, linux-media, kernel-team, Sumit Semwal,
	erickreyes, Daniel Vetter

On Thu, Mar 21, 2019 at 07:51:32PM -0700, Chenbo Feng wrote:
> Currently, all dma-bufs share the same anonymous inode. While we can count
> how many dma-buf fds or mappings a process has, we can't get the size of
> the backing buffers or tell if two entries point to the same dma-buf. And
> in debugfs, we can get a per-buffer breakdown of size and reference count,
> but can't tell which processes are actually holding the references to each
> buffer.
> 
> To resolve the issue above and provide better method for userspace to track
> the dma-buf usage across different processes, the following changes are
> proposed in dma-buf kernel side. First of all, replace the singleton inode
> inside the dma-buf subsystem with a mini-filesystem, and assign each
> dma-buf a unique inode out of this filesystem.  With this change, calling
> stat(2) on each entry gives the caller a unique ID (st_ino), the buffer's
> size (st_size), and even the number of pages assigned to each dma-buffer.
> Secoundly, add the inode information to /sys/kernel/debug/dma_buf/bufinfo
> so in the case where a buffer is mmap()ed into a process’s address space

Is there a better place to add bufinfo than debugfs, since debugfs is not
something Android may mount for non-debug uses in the future?

> but all remaining fds have been closed, we can still get the dma-buf
> information and try to accociate it with the process by searching the
> proc/pid/maps and looking for the corresponding inode number exposed in
> dma-buf debug fs. Thirdly, created an ioctl to assign names to dma-bufs
> which lets userspace assign short names (e.g., "CAMERA") to buffers. This
> information can be extremely helpful for tracking and accounting shared
> buffers based on their usage and original purpose. Last but not least, add
> dma-buf information to /proc/pid/fdinfo by adding a show_fdinfo() handler
> to dma_buf_file_operations. The handler will print the file_count and name
> of each buffer.

There are a couple more entries shown in fdinfo per patch 3/3 so it is worth
mentioning those here as well.

Also, is there kselftests to add or modify? I suggest we add those since they
are always invaluable.

I'll review more soon. Currently traveling. I am especially curious how you
created the new inodes since I recently had to do so for another usecase as
well. thanks!

 - Joel

> 
> Change in v2:
> * Add a check to prevent changing dma-buf name when it is attached to
>   devices.
> * Fixed some compile warnings
> 
> Greg Hackmann (3):
>   dma-buf: give each buffer a full-fledged inode
>   dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls
>   dma-buf: add show_fdinfo handler
> 
>  drivers/dma-buf/dma-buf.c    | 121 ++++++++++++++++++++++++++++++++---
>  include/linux/dma-buf.h      |   5 +-
>  include/uapi/linux/dma-buf.h |   4 ++
>  include/uapi/linux/magic.h   |   1 +
>  4 files changed, 122 insertions(+), 9 deletions(-)
> 
> -- 
> 2.21.0.rc2.261.ga7da99ff1b-goog
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-22  2:51 ` [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode Chenbo Feng
@ 2019-03-22 15:02   ` Joel Fernandes
  2019-03-24 17:56     ` Sandeep Patil
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Fernandes @ 2019-03-22 15:02 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: linux-kernel, dri-devel, linux-media, kernel-team, Sumit Semwal,
	erickreyes, Daniel Vetter, john.stultz

On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> From: Greg Hackmann <ghackmann@google.com>
> 
> By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> can get a lot of fine-grained data about how shmem buffers are shared
> among processes.  stat(2) on each entry gives the caller a unique
> ID (st_ino), the buffer's size (st_size), and even the number of pages
> currently charged to the buffer (st_blocks / 512).
> 
> In contrast, all dma-bufs share the same anonymous inode.  So while we
> can count how many dma-buf fds or mappings a process has, we can't get
> the size of the backing buffers or tell if two entries point to the same
> dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
> size and reference count, but can't tell which processes are actually
> holding the references to each buffer.
> 
> Replace the singleton inode with full-fledged inodes allocated by
> alloc_anon_inode().  This involves creating and mounting a
> mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> 
> Signed-off-by: Greg Hackmann <ghackmann@google.com>

I believe Greg's address needs to be updated on this patch otherwise the
emails would just bounce, no? I removed it from the CC list. You can still
keep the SOB I guess but remove it from the CC list when sending.

Also about the minifs, just playing devil's advocate for why this is needed.

Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
can just that not be used to get the size of the buffer? What is the benefit
of getting this from stat? The other way to get the size would be through
another IOCTL and that can be used to return other unique-ness related metadata
as well.  Neither of these need creation of a dedicated inode per dmabuf.
Imagine 1000s of dmabuf created in a system with rich graphics, then you have
inode allocated per dmabuf, sounds wasteful to me if not needed. You are also
adding the name per buffer, which can also be used to distinguish between
different buffers if that's the issue.

Also what is the benefit of having st_blocks from stat? AFAIK, that is the
same as the buffer's size which does not change for the lifetime of the
buffer. In your patch you're doing this when 'struct file' is created which
AIUI is what reflects in the st_blocks:
+	inode_set_bytes(inode, dmabuf->size);

Note that creating a new mount and allocating inode for each buf will consume
more kernel memory than before. Has there been a study of how much this is in
reality? Like by running a game? This is the point of using
anon_inode_getfile in the first place is to not have this wastage.

I am not against adding of inode per buffer, but I think we should have this
debate and make the right design choice here for what we really need.

thanks!

 - Joel


> Signed-off-by: Chenbo Feng <fengc@google.com>
> ---
>  drivers/dma-buf/dma-buf.c  | 63 ++++++++++++++++++++++++++++++++++----
>  include/uapi/linux/magic.h |  1 +
>  2 files changed, 58 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 7c858020d14b..ffd5a2ad7d6f 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -34,8 +34,10 @@
>  #include <linux/poll.h>
>  #include <linux/reservation.h>
>  #include <linux/mm.h>
> +#include <linux/mount.h>
>  
>  #include <uapi/linux/dma-buf.h>
> +#include <uapi/linux/magic.h>
>  
>  static inline int is_dma_buf_file(struct file *);
>  
> @@ -46,6 +48,25 @@ struct dma_buf_list {
>  
>  static struct dma_buf_list db_list;
>  
> +static const struct dentry_operations dma_buf_dentry_ops = {
> +	.d_dname = simple_dname,
> +};
> +
> +static struct vfsmount *dma_buf_mnt;
> +
> +static struct dentry *dma_buf_fs_mount(struct file_system_type *fs_type,
> +		int flags, const char *name, void *data)
> +{
> +	return mount_pseudo(fs_type, "dmabuf:", NULL, &dma_buf_dentry_ops,
> +			DMA_BUF_MAGIC);
> +}
> +
> +static struct file_system_type dma_buf_fs_type = {
> +	.name = "dmabuf",
> +	.mount = dma_buf_fs_mount,
> +	.kill_sb = kill_anon_super,
> +};
> +
>  static int dma_buf_release(struct inode *inode, struct file *file)
>  {
>  	struct dma_buf *dmabuf;
> @@ -338,6 +359,31 @@ static inline int is_dma_buf_file(struct file *file)
>  	return file->f_op == &dma_buf_fops;
>  }
>  
> +static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
> +{
> +	struct file *file;
> +	struct inode *inode = alloc_anon_inode(dma_buf_mnt->mnt_sb);
> +
> +	if (IS_ERR(inode))
> +		return ERR_CAST(inode);
> +
> +	inode->i_size = dmabuf->size;
> +	inode_set_bytes(inode, dmabuf->size);
> +
> +	file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> +				 flags, &dma_buf_fops);
> +	if (IS_ERR(file))
> +		goto err_alloc_file;
> +	file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> +	file->private_data = dmabuf;
> +
> +	return file;
> +
> +err_alloc_file:
> +	iput(inode);
> +	return file;
> +}
> +
>  /**
>   * DOC: dma buf device access
>   *
> @@ -433,8 +479,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>  	}
>  	dmabuf->resv = resv;
>  
> -	file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf,
> -					exp_info->flags);
> +	file = dma_buf_getfile(dmabuf, exp_info->flags);
>  	if (IS_ERR(file)) {
>  		ret = PTR_ERR(file);
>  		goto err_dmabuf;
> @@ -1025,8 +1070,8 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  		return ret;
>  
>  	seq_puts(s, "\nDma-buf Objects:\n");
> -	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\n",
> -		   "size", "flags", "mode", "count");
> +	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\t%-8s\n",
> +		   "size", "flags", "mode", "count", "ino");
>  
>  	list_for_each_entry(buf_obj, &db_list.head, list_node) {
>  		ret = mutex_lock_interruptible(&buf_obj->lock);
> @@ -1037,11 +1082,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  			continue;
>  		}
>  
> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\n",
> +		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\n",
>  				buf_obj->size,
>  				buf_obj->file->f_flags, buf_obj->file->f_mode,
>  				file_count(buf_obj->file),
> -				buf_obj->exp_name);
> +				buf_obj->exp_name,
> +				file_inode(buf_obj->file)->i_ino);
>  
>  		robj = buf_obj->resv;
>  		while (true) {
> @@ -1136,6 +1182,10 @@ static inline void dma_buf_uninit_debugfs(void)
>  
>  static int __init dma_buf_init(void)
>  {
> +	dma_buf_mnt = kern_mount(&dma_buf_fs_type);
> +	if (IS_ERR(dma_buf_mnt))
> +		return PTR_ERR(dma_buf_mnt);
> +
>  	mutex_init(&db_list.lock);
>  	INIT_LIST_HEAD(&db_list.head);
>  	dma_buf_init_debugfs();
> @@ -1146,5 +1196,6 @@ subsys_initcall(dma_buf_init);
>  static void __exit dma_buf_deinit(void)
>  {
>  	dma_buf_uninit_debugfs();
> +	kern_unmount(dma_buf_mnt);
>  }
>  __exitcall(dma_buf_deinit);
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index f8c00045d537..665e18627f78 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -91,5 +91,6 @@
>  #define UDF_SUPER_MAGIC		0x15013346
>  #define BALLOON_KVM_MAGIC	0x13661366
>  #define ZSMALLOC_MAGIC		0x58295829
> +#define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> -- 
> 2.21.0.225.g810b269d1ac-goog
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-22 15:02   ` Joel Fernandes
@ 2019-03-24 17:56     ` Sandeep Patil
  2019-03-24 20:44       ` Joel Fernandes
  0 siblings, 1 reply; 11+ messages in thread
From: Sandeep Patil @ 2019-03-24 17:56 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Chenbo Feng, linux-kernel, dri-devel, linux-media, kernel-team,
	Sumit Semwal, erickreyes, Daniel Vetter, john.stultz

On Fri, Mar 22, 2019 at 11:02:55AM -0400, Joel Fernandes wrote:
> On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> > From: Greg Hackmann <ghackmann@google.com>
> > 
> > By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> > can get a lot of fine-grained data about how shmem buffers are shared
> > among processes.  stat(2) on each entry gives the caller a unique
> > ID (st_ino), the buffer's size (st_size), and even the number of pages
> > currently charged to the buffer (st_blocks / 512).
> > 
> > In contrast, all dma-bufs share the same anonymous inode.  So while we
> > can count how many dma-buf fds or mappings a process has, we can't get
> > the size of the backing buffers or tell if two entries point to the same
> > dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
> > size and reference count, but can't tell which processes are actually
> > holding the references to each buffer.
> > 
> > Replace the singleton inode with full-fledged inodes allocated by
> > alloc_anon_inode().  This involves creating and mounting a
> > mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> > 
> > Signed-off-by: Greg Hackmann <ghackmann@google.com>
> 
> I believe Greg's address needs to be updated on this patch otherwise the
> emails would just bounce, no? I removed it from the CC list. You can still
> keep the SOB I guess but remove it from the CC list when sending.
> 
> Also about the minifs, just playing devil's advocate for why this is needed.
> 
> Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
> can just that not be used to get the size of the buffer? What is the benefit
> of getting this from stat? The other way to get the size would be through
> another IOCTL and that can be used to return other unique-ness related metadata
> as well.  Neither of these need creation of a dedicated inode per dmabuf.

Can you give an example of "unique-ness related data" here? The inode seems
like the best fit cause its already unique, no?

> Imagine 1000s of dmabuf created in a system with rich graphics, then you have
> inode allocated per dmabuf, sounds wasteful to me if not needed.

It is, in fact, needed. The only way to differentiate a buffer owned by the
same process of the same size (which is very common) is via this unique
inode. Benefits of attaching the buffer metadata to the inode are that now
the metadata can be easily queried from userspace using stat(2) too.

> You are also
> adding the name per buffer, which can also be used to distinguish between
> different buffers if that's the issue.

The name is not unique, unless its decided by the kernel (which we don't want).
The names are coming in from user space in this case and are generally used
to track the "type / use case" of the buffer for debugging / triaging
purposes. We didn't want to rely on user space to give each buffer a unique
name, neither do we want to be responsible for generating one.

> 
> Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> same as the buffer's size which does not change for the lifetime of the
> buffer. In your patch you're doing this when 'struct file' is created which
> AIUI is what reflects in the st_blocks:
> +	inode_set_bytes(inode, dmabuf->size);

Can some of the use cases / data be trimmed down? I think so. For example, I
never understood what we do with map_files here (or why). It is perfectly
fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
the map_files bit is for consistency?

> 
> Note that creating a new mount and allocating inode for each buf will consume
> more kernel memory than before. Has there been a study of how much this is in
> reality? Like by running a game? This is the point of using
> anon_inode_getfile in the first place is to not have this wastage.

Its not "wastage" if its being used to track memeory IMHO. To give
some perspective, On Pixel 2 after fresh reboot + launch camera, we track
about 444 dmabuf objects with this that amounted to about 568573952
untracked memory. The cost of 444 'struct inode' compared to the 500+MB of
untrackable memory is miniscule.

May be, to make it generic, we make the tracking part optional somehow to
avoid the apparent wastage on other systems.

> 
> I am not against adding of inode per buffer, but I think we should have this
> debate and make the right design choice here for what we really need.

sure.

- ssp
> 
> thanks!
> 
>  - Joel
> 
> 
> > Signed-off-by: Chenbo Feng <fengc@google.com>
> > ---
> >  drivers/dma-buf/dma-buf.c  | 63 ++++++++++++++++++++++++++++++++++----
> >  include/uapi/linux/magic.h |  1 +
> >  2 files changed, 58 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 7c858020d14b..ffd5a2ad7d6f 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -34,8 +34,10 @@
> >  #include <linux/poll.h>
> >  #include <linux/reservation.h>
> >  #include <linux/mm.h>
> > +#include <linux/mount.h>
> >  
> >  #include <uapi/linux/dma-buf.h>
> > +#include <uapi/linux/magic.h>
> >  
> >  static inline int is_dma_buf_file(struct file *);
> >  
> > @@ -46,6 +48,25 @@ struct dma_buf_list {
> >  
> >  static struct dma_buf_list db_list;
> >  
> > +static const struct dentry_operations dma_buf_dentry_ops = {
> > +	.d_dname = simple_dname,
> > +};
> > +
> > +static struct vfsmount *dma_buf_mnt;
> > +
> > +static struct dentry *dma_buf_fs_mount(struct file_system_type *fs_type,
> > +		int flags, const char *name, void *data)
> > +{
> > +	return mount_pseudo(fs_type, "dmabuf:", NULL, &dma_buf_dentry_ops,
> > +			DMA_BUF_MAGIC);
> > +}
> > +
> > +static struct file_system_type dma_buf_fs_type = {
> > +	.name = "dmabuf",
> > +	.mount = dma_buf_fs_mount,
> > +	.kill_sb = kill_anon_super,
> > +};
> > +
> >  static int dma_buf_release(struct inode *inode, struct file *file)
> >  {
> >  	struct dma_buf *dmabuf;
> > @@ -338,6 +359,31 @@ static inline int is_dma_buf_file(struct file *file)
> >  	return file->f_op == &dma_buf_fops;
> >  }
> >  
> > +static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
> > +{
> > +	struct file *file;
> > +	struct inode *inode = alloc_anon_inode(dma_buf_mnt->mnt_sb);
> > +
> > +	if (IS_ERR(inode))
> > +		return ERR_CAST(inode);
> > +
> > +	inode->i_size = dmabuf->size;
> > +	inode_set_bytes(inode, dmabuf->size);
> > +
> > +	file = alloc_file_pseudo(inode, dma_buf_mnt, "dmabuf",
> > +				 flags, &dma_buf_fops);
> > +	if (IS_ERR(file))
> > +		goto err_alloc_file;
> > +	file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
> > +	file->private_data = dmabuf;
> > +
> > +	return file;
> > +
> > +err_alloc_file:
> > +	iput(inode);
> > +	return file;
> > +}
> > +
> >  /**
> >   * DOC: dma buf device access
> >   *
> > @@ -433,8 +479,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >  	}
> >  	dmabuf->resv = resv;
> >  
> > -	file = anon_inode_getfile("dmabuf", &dma_buf_fops, dmabuf,
> > -					exp_info->flags);
> > +	file = dma_buf_getfile(dmabuf, exp_info->flags);
> >  	if (IS_ERR(file)) {
> >  		ret = PTR_ERR(file);
> >  		goto err_dmabuf;
> > @@ -1025,8 +1070,8 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
> >  		return ret;
> >  
> >  	seq_puts(s, "\nDma-buf Objects:\n");
> > -	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\n",
> > -		   "size", "flags", "mode", "count");
> > +	seq_printf(s, "%-8s\t%-8s\t%-8s\t%-8s\texp_name\t%-8s\n",
> > +		   "size", "flags", "mode", "count", "ino");
> >  
> >  	list_for_each_entry(buf_obj, &db_list.head, list_node) {
> >  		ret = mutex_lock_interruptible(&buf_obj->lock);
> > @@ -1037,11 +1082,12 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
> >  			continue;
> >  		}
> >  
> > -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\n",
> > +		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\n",
> >  				buf_obj->size,
> >  				buf_obj->file->f_flags, buf_obj->file->f_mode,
> >  				file_count(buf_obj->file),
> > -				buf_obj->exp_name);
> > +				buf_obj->exp_name,
> > +				file_inode(buf_obj->file)->i_ino);
> >  
> >  		robj = buf_obj->resv;
> >  		while (true) {
> > @@ -1136,6 +1182,10 @@ static inline void dma_buf_uninit_debugfs(void)
> >  
> >  static int __init dma_buf_init(void)
> >  {
> > +	dma_buf_mnt = kern_mount(&dma_buf_fs_type);
> > +	if (IS_ERR(dma_buf_mnt))
> > +		return PTR_ERR(dma_buf_mnt);
> > +
> >  	mutex_init(&db_list.lock);
> >  	INIT_LIST_HEAD(&db_list.head);
> >  	dma_buf_init_debugfs();
> > @@ -1146,5 +1196,6 @@ subsys_initcall(dma_buf_init);
> >  static void __exit dma_buf_deinit(void)
> >  {
> >  	dma_buf_uninit_debugfs();
> > +	kern_unmount(dma_buf_mnt);
> >  }
> >  __exitcall(dma_buf_deinit);
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index f8c00045d537..665e18627f78 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -91,5 +91,6 @@
> >  #define UDF_SUPER_MAGIC		0x15013346
> >  #define BALLOON_KVM_MAGIC	0x13661366
> >  #define ZSMALLOC_MAGIC		0x58295829
> > +#define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> >  
> >  #endif /* __LINUX_MAGIC_H__ */
> > -- 
> > 2.21.0.225.g810b269d1ac-goog
> > 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-24 17:56     ` Sandeep Patil
@ 2019-03-24 20:44       ` Joel Fernandes
  2019-03-25 19:34         ` Chenbo Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Fernandes @ 2019-03-24 20:44 UTC (permalink / raw)
  To: Sandeep Patil
  Cc: Chenbo Feng, linux-kernel, dri-devel, linux-media, kernel-team,
	Sumit Semwal, erickreyes, Daniel Vetter, john.stultz

Hi Sandeep,

On Sun, Mar 24, 2019 at 10:56:33AM -0700, Sandeep Patil wrote:
> On Fri, Mar 22, 2019 at 11:02:55AM -0400, Joel Fernandes wrote:
> > On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> > > From: Greg Hackmann <ghackmann@google.com>
> > > 
> > > By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> > > can get a lot of fine-grained data about how shmem buffers are shared
> > > among processes.  stat(2) on each entry gives the caller a unique
> > > ID (st_ino), the buffer's size (st_size), and even the number of pages
> > > currently charged to the buffer (st_blocks / 512).
> > > 
> > > In contrast, all dma-bufs share the same anonymous inode.  So while we
> > > can count how many dma-buf fds or mappings a process has, we can't get
> > > the size of the backing buffers or tell if two entries point to the same
> > > dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
> > > size and reference count, but can't tell which processes are actually
> > > holding the references to each buffer.
> > > 
> > > Replace the singleton inode with full-fledged inodes allocated by
> > > alloc_anon_inode().  This involves creating and mounting a
> > > mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> > > 
> > > Signed-off-by: Greg Hackmann <ghackmann@google.com>
> > 
> > I believe Greg's address needs to be updated on this patch otherwise the
> > emails would just bounce, no? I removed it from the CC list. You can still
> > keep the SOB I guess but remove it from the CC list when sending.
> > 
> > Also about the minifs, just playing devil's advocate for why this is needed.
> > 
> > Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
> > can just that not be used to get the size of the buffer? What is the benefit
> > of getting this from stat? The other way to get the size would be through
> > another IOCTL and that can be used to return other unique-ness related metadata
> > as well.  Neither of these need creation of a dedicated inode per dmabuf.
> 
> Can you give an example of "unique-ness related data" here? The inode seems
> like the best fit cause its already unique, no?

I was thinking dma_buf file pointer, but I agree we need the per-inode now (see below).

> > Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> > same as the buffer's size which does not change for the lifetime of the
> > buffer. In your patch you're doing this when 'struct file' is created which
> > AIUI is what reflects in the st_blocks:
> > +	inode_set_bytes(inode, dmabuf->size);
> 
> Can some of the use cases / data be trimmed down? I think so. For example, I
> never understood what we do with map_files here (or why). It is perfectly
> fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
> the map_files bit is for consistency?

It just occured to me that since /proc/<pid/maps provides an inode number as
one of the fields, so indeed an inode per buf is a very good idea for the
user to distinguish buffers just by reading /proc/<pid/<maps> alone..

I also, similar to you, don't think map_files is useful for this usecase.
map_files are just symlinks named as memory ranges and pointing to a file. In
this case the symlink will point to default name "dmabuf" that doesn't exist.
If one does stat(2) on a map_file link, then it just returns the inode number
of the symlink, not what the map_file is pointing to, which seems kind of
useless.

Which makes me think both maps and map_files can be made more useful if we can
also make DMA_BUF_SET_NAME in the patch change the underlying dentry's name
from the default "dmabuf" to "dmabuf:<name>" ?

That would be useful because:
1. It should make /proc/pid/maps also have the name than always showing
"dmabuf".
2. It should make map_files also point to the name of the buffer than just
"dmabuf". Note that memfd_create(2) already takes a name and the maps_file
for this points to the name of the buffer created and showing it in both maps
and map_files.

I think this also removes the need for DMA_BUF_GET_NAME ioctl since the
dentry's name already has the information. I can try to look into that...
BTW any case we should not need GET_NAME ioctl since fdinfo already has the
name after SET_NAME is called. So let us drop that API?

> May be, to make it generic, we make the tracking part optional somehow to
> avoid the apparent wastage on other systems.

Yes, that's also fine. But I think if we can bake tracking into existing
mechanism and keep it always On, then that's also good for all other dmabuf
users as well and simplifies the kernel configuration for vendors.

> > I am not against adding of inode per buffer, but I think we should have this
> > debate and make the right design choice here for what we really need.
> 
> sure.

Right, so just to summarize:
- The intention here is /proc/<pid>/maps will give uniqueness (via the inode
  number) between different memory ranges. That I think is the main benefit
  of the patch.
- stat gives the size of buffer as does fdinfo
- fdinfo is useful to get the reference count of number of sharers of the
  buffer.
- map_files is not that useful for this usecase but can be made useful if
  we can name the underlying file's dentry to something other than "dmabuf".
- GET_NAME is not needed since fdinfo already has the SET_NAMEd name.

Do you agree?

Just to lay it out, there is a cost to unique inode. Each struct inode is 560
bytes on mainline with x86_64_defconfig. With 1000 buffers, we're looking at
~ 0.5MB of allocation. However I think I am convinced we need to do it
considering the advantages, and the size is trivial considering advantages.
Arguably large number dmabuf allocations are more likely to succeed with
devices with larger memory resources anyway :)

It is good to have this discussion. 

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-24 20:44       ` Joel Fernandes
@ 2019-03-25 19:34         ` Chenbo Feng
  2019-03-25 21:47           ` Joel Fernandes
  2019-03-25 21:48           ` Erick Reyes
  0 siblings, 2 replies; 11+ messages in thread
From: Chenbo Feng @ 2019-03-25 19:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Sandeep Patil, LKML, DRI mailing list, linux-media, kernel-team,
	Sumit Semwal, Erick Reyes, Daniel Vetter, John Stultz

On Sun, Mar 24, 2019 at 1:45 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Sandeep,
>
> On Sun, Mar 24, 2019 at 10:56:33AM -0700, Sandeep Patil wrote:
> > On Fri, Mar 22, 2019 at 11:02:55AM -0400, Joel Fernandes wrote:
> > > On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> > > > From: Greg Hackmann <ghackmann@google.com>
> > > >
> > > > By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> > > > can get a lot of fine-grained data about how shmem buffers are shared
> > > > among processes.  stat(2) on each entry gives the caller a unique
> > > > ID (st_ino), the buffer's size (st_size), and even the number of pages
> > > > currently charged to the buffer (st_blocks / 512).
> > > >
> > > > In contrast, all dma-bufs share the same anonymous inode.  So while we
> > > > can count how many dma-buf fds or mappings a process has, we can't get
> > > > the size of the backing buffers or tell if two entries point to the same
> > > > dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
> > > > size and reference count, but can't tell which processes are actually
> > > > holding the references to each buffer.
> > > >
> > > > Replace the singleton inode with full-fledged inodes allocated by
> > > > alloc_anon_inode().  This involves creating and mounting a
> > > > mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> > > >
> > > > Signed-off-by: Greg Hackmann <ghackmann@google.com>
> > >
> > > I believe Greg's address needs to be updated on this patch otherwise the
> > > emails would just bounce, no? I removed it from the CC list. You can still
> > > keep the SOB I guess but remove it from the CC list when sending.
> > >
> > > Also about the minifs, just playing devil's advocate for why this is needed.
> > >
> > > Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
> > > can just that not be used to get the size of the buffer? What is the benefit
> > > of getting this from stat? The other way to get the size would be through
> > > another IOCTL and that can be used to return other unique-ness related metadata
> > > as well.  Neither of these need creation of a dedicated inode per dmabuf.
> >
> > Can you give an example of "unique-ness related data" here? The inode seems
> > like the best fit cause its already unique, no?
>
> I was thinking dma_buf file pointer, but I agree we need the per-inode now (see below).
>
> > > Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> > > same as the buffer's size which does not change for the lifetime of the
> > > buffer. In your patch you're doing this when 'struct file' is created which
> > > AIUI is what reflects in the st_blocks:
> > > +   inode_set_bytes(inode, dmabuf->size);
> >
> > Can some of the use cases / data be trimmed down? I think so. For example, I
> > never understood what we do with map_files here (or why). It is perfectly
> > fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
> > the map_files bit is for consistency?
>
> It just occured to me that since /proc/<pid/maps provides an inode number as
> one of the fields, so indeed an inode per buf is a very good idea for the
> user to distinguish buffers just by reading /proc/<pid/<maps> alone..
>
> I also, similar to you, don't think map_files is useful for this usecase.
> map_files are just symlinks named as memory ranges and pointing to a file. In
> this case the symlink will point to default name "dmabuf" that doesn't exist.
> If one does stat(2) on a map_file link, then it just returns the inode number
> of the symlink, not what the map_file is pointing to, which seems kind of
> useless.
>
I might be wrong but I don't think we did anything special for the
map_files in this patch. I think the confusion might be from commit
message where Greg mentioned the map_files to describe the behavior of
shmem buffer when comparing it with dma-buf. The file system
implementation and the file allocation action in this patch are just
some minimal effort to associate each dma_buf object with an inode and
properly populate the size information in the file object. And we
didn't use proc/pid/map_files at all in the android implementation
indeed.
>
> Which makes me think both maps and map_files can be made more useful if we can
> also make DMA_BUF_SET_NAME in the patch change the underlying dentry's name
> from the default "dmabuf" to "dmabuf:<name>" ?
>
> That would be useful because:
> 1. It should make /proc/pid/maps also have the name than always showing
> "dmabuf".
> 2. It should make map_files also point to the name of the buffer than just
> "dmabuf". Note that memfd_create(2) already takes a name and the maps_file
> for this points to the name of the buffer created and showing it in both maps
> and map_files.
>
> I think this also removes the need for DMA_BUF_GET_NAME ioctl since the
> dentry's name already has the information. I can try to look into that...
> BTW any case we should not need GET_NAME ioctl since fdinfo already has the
> name after SET_NAME is called. So let us drop that API?
>
> > May be, to make it generic, we make the tracking part optional somehow to
> > avoid the apparent wastage on other systems.
>
> Yes, that's also fine. But I think if we can bake tracking into existing
> mechanism and keep it always On, then that's also good for all other dmabuf
> users as well and simplifies the kernel configuration for vendors.
>
> > > I am not against adding of inode per buffer, but I think we should have this
> > > debate and make the right design choice here for what we really need.
> >
> > sure.
>
> Right, so just to summarize:
> - The intention here is /proc/<pid>/maps will give uniqueness (via the inode
>   number) between different memory ranges. That I think is the main benefit
>   of the patch.
> - stat gives the size of buffer as does fdinfo
> - fdinfo is useful to get the reference count of number of sharers of the
>   buffer.
> - map_files is not that useful for this usecase but can be made useful if
>   we can name the underlying file's dentry to something other than "dmabuf".
> - GET_NAME is not needed since fdinfo already has the SET_NAMEd name.
>
> Do you agree?
>
Thanks for summarize it, I will look into the GET_NAME/SET_NAME ioctl
to make it more useful as you suggested above. Also, I will try to add
some test to verify the behavior.
>
> Just to lay it out, there is a cost to unique inode. Each struct inode is 560
> bytes on mainline with x86_64_defconfig. With 1000 buffers, we're looking at
> ~ 0.5MB of allocation. However I think I am convinced we need to do it
> considering the advantages, and the size is trivial considering advantages.
> Arguably large number dmabuf allocations are more likely to succeed with
> devices with larger memory resources anyway :)
>
> It is good to have this discussion.
>
> thanks,
>
>  - Joel
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-25 19:34         ` Chenbo Feng
@ 2019-03-25 21:47           ` Joel Fernandes
  2019-03-25 21:48           ` Erick Reyes
  1 sibling, 0 replies; 11+ messages in thread
From: Joel Fernandes @ 2019-03-25 21:47 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: Sandeep Patil, LKML, DRI mailing list, linux-media, kernel-team,
	Sumit Semwal, Erick Reyes, Daniel Vetter, John Stultz

On Mon, Mar 25, 2019 at 12:34:59PM -0700, Chenbo Feng wrote:
[snip]
> > > > Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> > > > same as the buffer's size which does not change for the lifetime of the
> > > > buffer. In your patch you're doing this when 'struct file' is created which
> > > > AIUI is what reflects in the st_blocks:
> > > > +   inode_set_bytes(inode, dmabuf->size);
> > >
> > > Can some of the use cases / data be trimmed down? I think so. For example, I
> > > never understood what we do with map_files here (or why). It is perfectly
> > > fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
> > > the map_files bit is for consistency?
> >
> > It just occured to me that since /proc/<pid/maps provides an inode number as
> > one of the fields, so indeed an inode per buf is a very good idea for the
> > user to distinguish buffers just by reading /proc/<pid/<maps> alone..
> >
> > I also, similar to you, don't think map_files is useful for this usecase.
> > map_files are just symlinks named as memory ranges and pointing to a file. In
> > this case the symlink will point to default name "dmabuf" that doesn't exist.
> > If one does stat(2) on a map_file link, then it just returns the inode number
> > of the symlink, not what the map_file is pointing to, which seems kind of
> > useless.
> >
> I might be wrong but I don't think we did anything special for the
> map_files in this patch. I think the confusion might be from commit
> message where Greg mentioned the map_files to describe the behavior of
> shmem buffer when comparing it with dma-buf. The file system
> implementation and the file allocation action in this patch are just
> some minimal effort to associate each dma_buf object with an inode and
> properly populate the size information in the file object. And we
> didn't use proc/pid/map_files at all in the android implementation
> indeed.

You are right.

> > > > I am not against adding of inode per buffer, but I think we should have this
> > > > debate and make the right design choice here for what we really need.
> > >
> > > sure.
> >
> > Right, so just to summarize:
> > - The intention here is /proc/<pid>/maps will give uniqueness (via the inode
> >   number) between different memory ranges. That I think is the main benefit
> >   of the patch.
> > - stat gives the size of buffer as does fdinfo
> > - fdinfo is useful to get the reference count of number of sharers of the
> >   buffer.
> > - map_files is not that useful for this usecase but can be made useful if
> >   we can name the underlying file's dentry to something other than "dmabuf".
> > - GET_NAME is not needed since fdinfo already has the SET_NAMEd name.
> >
> > Do you agree?
> >
> Thanks for summarize it, I will look into the GET_NAME/SET_NAME ioctl
> to make it more useful as you suggested above. Also, I will try to add
> some test to verify the behavior.

Sounds great, thanks!

 - Joel


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode
  2019-03-25 19:34         ` Chenbo Feng
  2019-03-25 21:47           ` Joel Fernandes
@ 2019-03-25 21:48           ` Erick Reyes
  1 sibling, 0 replies; 11+ messages in thread
From: Erick Reyes @ 2019-03-25 21:48 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: Joel Fernandes, Sandeep Patil, LKML, DRI mailing list,
	linux-media, kernel-team, Sumit Semwal, Daniel Vetter,
	John Stultz

In the original userspace implementation Greg wrote, he was iterating
the directory entries in proc/<pid>/map_files, doing readlink() on
each to find out whether the entry was a dmabuf. This turned out to be
very slow so we reworked it to parse proc/<pid>/maps instead.


On Mon, Mar 25, 2019 at 12:35 PM Chenbo Feng <fengc@google.com> wrote:
>
> On Sun, Mar 24, 2019 at 1:45 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Sandeep,
> >
> > On Sun, Mar 24, 2019 at 10:56:33AM -0700, Sandeep Patil wrote:
> > > On Fri, Mar 22, 2019 at 11:02:55AM -0400, Joel Fernandes wrote:
> > > > On Thu, Mar 21, 2019 at 07:51:33PM -0700, Chenbo Feng wrote:
> > > > > From: Greg Hackmann <ghackmann@google.com>
> > > > >
> > > > > By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN
> > > > > can get a lot of fine-grained data about how shmem buffers are shared
> > > > > among processes.  stat(2) on each entry gives the caller a unique
> > > > > ID (st_ino), the buffer's size (st_size), and even the number of pages
> > > > > currently charged to the buffer (st_blocks / 512).
> > > > >
> > > > > In contrast, all dma-bufs share the same anonymous inode.  So while we
> > > > > can count how many dma-buf fds or mappings a process has, we can't get
> > > > > the size of the backing buffers or tell if two entries point to the same
> > > > > dma-buf.  On systems with debugfs, we can get a per-buffer breakdown of
> > > > > size and reference count, but can't tell which processes are actually
> > > > > holding the references to each buffer.
> > > > >
> > > > > Replace the singleton inode with full-fledged inodes allocated by
> > > > > alloc_anon_inode().  This involves creating and mounting a
> > > > > mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
> > > > >
> > > > > Signed-off-by: Greg Hackmann <ghackmann@google.com>
> > > >
> > > > I believe Greg's address needs to be updated on this patch otherwise the
> > > > emails would just bounce, no? I removed it from the CC list. You can still
> > > > keep the SOB I guess but remove it from the CC list when sending.
> > > >
> > > > Also about the minifs, just playing devil's advocate for why this is needed.
> > > >
> > > > Since you are already adding the size information to /proc/pid/fdinfo/<fd> ,
> > > > can just that not be used to get the size of the buffer? What is the benefit
> > > > of getting this from stat? The other way to get the size would be through
> > > > another IOCTL and that can be used to return other unique-ness related metadata
> > > > as well.  Neither of these need creation of a dedicated inode per dmabuf.
> > >
> > > Can you give an example of "unique-ness related data" here? The inode seems
> > > like the best fit cause its already unique, no?
> >
> > I was thinking dma_buf file pointer, but I agree we need the per-inode now (see below).
> >
> > > > Also what is the benefit of having st_blocks from stat? AFAIK, that is the
> > > > same as the buffer's size which does not change for the lifetime of the
> > > > buffer. In your patch you're doing this when 'struct file' is created which
> > > > AIUI is what reflects in the st_blocks:
> > > > +   inode_set_bytes(inode, dmabuf->size);
> > >
> > > Can some of the use cases / data be trimmed down? I think so. For example, I
> > > never understood what we do with map_files here (or why). It is perfectly
> > > fine to just get the data from /proc/<pid>/fd and /proc/<pid>/maps. I guess
> > > the map_files bit is for consistency?
> >
> > It just occured to me that since /proc/<pid/maps provides an inode number as
> > one of the fields, so indeed an inode per buf is a very good idea for the
> > user to distinguish buffers just by reading /proc/<pid/<maps> alone..
> >
> > I also, similar to you, don't think map_files is useful for this usecase.
> > map_files are just symlinks named as memory ranges and pointing to a file. In
> > this case the symlink will point to default name "dmabuf" that doesn't exist.
> > If one does stat(2) on a map_file link, then it just returns the inode number
> > of the symlink, not what the map_file is pointing to, which seems kind of
> > useless.
> >
> I might be wrong but I don't think we did anything special for the
> map_files in this patch. I think the confusion might be from commit
> message where Greg mentioned the map_files to describe the behavior of
> shmem buffer when comparing it with dma-buf. The file system
> implementation and the file allocation action in this patch are just
> some minimal effort to associate each dma_buf object with an inode and
> properly populate the size information in the file object. And we
> didn't use proc/pid/map_files at all in the android implementation
> indeed.
> >
> > Which makes me think both maps and map_files can be made more useful if we can
> > also make DMA_BUF_SET_NAME in the patch change the underlying dentry's name
> > from the default "dmabuf" to "dmabuf:<name>" ?
> >
> > That would be useful because:
> > 1. It should make /proc/pid/maps also have the name than always showing
> > "dmabuf".
> > 2. It should make map_files also point to the name of the buffer than just
> > "dmabuf". Note that memfd_create(2) already takes a name and the maps_file
> > for this points to the name of the buffer created and showing it in both maps
> > and map_files.
> >
> > I think this also removes the need for DMA_BUF_GET_NAME ioctl since the
> > dentry's name already has the information. I can try to look into that...
> > BTW any case we should not need GET_NAME ioctl since fdinfo already has the
> > name after SET_NAME is called. So let us drop that API?
> >
> > > May be, to make it generic, we make the tracking part optional somehow to
> > > avoid the apparent wastage on other systems.
> >
> > Yes, that's also fine. But I think if we can bake tracking into existing
> > mechanism and keep it always On, then that's also good for all other dmabuf
> > users as well and simplifies the kernel configuration for vendors.
> >
> > > > I am not against adding of inode per buffer, but I think we should have this
> > > > debate and make the right design choice here for what we really need.
> > >
> > > sure.
> >
> > Right, so just to summarize:
> > - The intention here is /proc/<pid>/maps will give uniqueness (via the inode
> >   number) between different memory ranges. That I think is the main benefit
> >   of the patch.
> > - stat gives the size of buffer as does fdinfo
> > - fdinfo is useful to get the reference count of number of sharers of the
> >   buffer.
> > - map_files is not that useful for this usecase but can be made useful if
> >   we can name the underlying file's dentry to something other than "dmabuf".
> > - GET_NAME is not needed since fdinfo already has the SET_NAMEd name.
> >
> > Do you agree?
> >
> Thanks for summarize it, I will look into the GET_NAME/SET_NAME ioctl
> to make it more useful as you suggested above. Also, I will try to add
> some test to verify the behavior.
> >
> > Just to lay it out, there is a cost to unique inode. Each struct inode is 560
> > bytes on mainline with x86_64_defconfig. With 1000 buffers, we're looking at
> > ~ 0.5MB of allocation. However I think I am convinced we need to do it
> > considering the advantages, and the size is trivial considering advantages.
> > Arguably large number dmabuf allocations are more likely to succeed with
> > devices with larger memory resources anyway :)
> >
> > It is good to have this discussion.
> >
> > thanks,
> >
> >  - Joel
> >
> > --
> > You received this message because you are subscribed to the Google Groups "kernel-team" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-03-25 21:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22  2:51 [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Chenbo Feng
2019-03-22  2:51 ` [RFC v2 1/3] dma-buf: give each buffer a full-fledged inode Chenbo Feng
2019-03-22 15:02   ` Joel Fernandes
2019-03-24 17:56     ` Sandeep Patil
2019-03-24 20:44       ` Joel Fernandes
2019-03-25 19:34         ` Chenbo Feng
2019-03-25 21:47           ` Joel Fernandes
2019-03-25 21:48           ` Erick Reyes
2019-03-22  2:51 ` [RFC v2 2/3] dma-buf: add DMA_BUF_{GET,SET}_NAME ioctls Chenbo Feng
2019-03-22  2:51 ` [RFC v2 3/3] dma-buf: add show_fdinfo handler Chenbo Feng
2019-03-22 12:29 ` [RFC v2 dma-buf 0/3] Improve the dma-buf tracking Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).