linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] binder: implement binderfs
@ 2018-12-04 13:12 Christian Brauner
  2018-12-05 20:01 ` Greg KH
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Brauner @ 2018-12-04 13:12 UTC (permalink / raw)
  To: gregkh, tkjos, maco, linux-kernel
  Cc: arve, joel, darrick.wong, david, kilobyte, devel, chouryzhou,
	Christian Brauner, Todd Kjos

As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
implementation of binderfs. If you want to skip reading and just see how it
works, please go to [2].

binderfs is a backwards-compatible filesystem for Android's binder ipc
mechanism. Each ipc namespace will mount a new binderfs instance. Mounting
binderfs multiple times at different locations in the same ipc namespace
will not cause a new super block to be allocated and hence it will be the
same filesystem instance.
Each new binderfs mount will have its own set of binder devices only
visible in the ipc namespace it has been mounted in. All devices in a new
binderfs mount will follow the scheme binder%d and numbering will always
start at 0.

/* Backwards compatibility */
Devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES for the
initial ipc namespace will work as before. They will be registered via
misc_register() and appear in the devtmpfs mount. Specifically, the
standard devices binder, hwbinder, and vndbinder will all appear in their
standard locations in /dev. Mounting or unmounting the binderfs mount in
the initial ipc namespace will have no effect on these devices, i.e. they
will neither show up in the binderfs mount nor will they disappear when the
binderfs mount is gone.

/* binder-control */
Each new binderfs instance comes with a binder-control device. No other
devices will be present at first. The binder-control device can be used to
dynamically allocate binder devices. All requests operate on the binderfs
mount the binder-control device resides in:
- BINDER_CTL_ADD
  Allocate a new binder device.
Assuming a new instance of binderfs has been mounted at /dev/binderfs via
mount -t binderfs binderfs /dev/binderfs. Then a request to create a new
binder device can be made via:

struct binderfs_device device = {0};
int fd = open("/dev/binderfs/binder-control", O_RDWR);
ioctl(fd, BINDER_CTL_ADD, &device);

The struct binderfs_device will be used to return the major and minor
number, as well as the index used as the new name for the device.
Binderfs devices can simply be removed via unlink().

/* Implementation details */
- When binderfs is registered as a new filesystem it will dynamically
  allocate a new major number. The allocated major number will be returned
  in struct binderfs_device when a new binder device is allocated.
  Minor numbers that have been given out are tracked in a global idr struct
  that is capped at BINDERFS_MAX_MINOR. The minor number tracker is
  protected by a global mutex. This is the only point of contention between
  binderfs mounts.
- The naming scheme for binder devices is binder%d. Each binderfs mount
  starts numbering of new binder devices at 0 up to n. The indeces used in
  constructing the name are tracked in a struct idr that is per-binderfs
  super block.
- Each binderfs super block has its own struct binderfs_info that tracks
  specific details about a binderfs instance: the ipc namespace, the idr
  tracker for new names, the dentry of the binder-control device, the
  root uid and gid of the user namespace the binderfs instance was mounted
  in, and a mutex to protect the binder-control device and idr tracker from
  concurrent access.
- binderfs can be mounted by user namespace root in a non-initial user
  namespace. The devices will be owned by user namespace root.
- New binder devices associated with a binderfs mount do not use the
  full misc_register() infrastructure. The misc_register() infrastructure
  can only create new devices in the host's devtmpfs mount. binderfs does
  however only make devices appear under its own mountpoint and thus
  allocates new character devices nodes from the inode of the root dentry
  of the super block. This will have the side-effect that binderfs specific
  device nodes do not appear in sysfs. This behavior is similar to devpts
  allocated pts devices and has no effect on the functionality of the ipc
  mechanism itself.

/* Create a new binder device in a binderfs mount */
sudo mkdir /dev/binderfs
sudo mount -t binder binder /dev/binderfs

 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <unistd.h>
 #include <linux/android/binder_ctl.h>

 int main(int argc, char *argv[])
 {
         int fd, ret, saved_errno;
         struct binderfs_device device = { 0 };

         fd = open("/dev/binderfs/binder-control", O_RDONLY | O_CLOEXEC);
         if (fd < 0) {
                 printf("%s - Failed to open binder-control device\n",
                        strerror(errno));
                 exit(EXIT_FAILURE);
         }

         ret = ioctl(fd, BINDER_CTL_ADD, &device);
         saved_errno = errno;
         close(fd);
         errno = saved_errno;
         if (ret < 0) {
                 printf("%s - Failed to allocate new binder device\n",
                        strerror(errno));
                 exit(EXIT_FAILURE);
         }

         printf("Allocated new binder device with major %d, minor %d, and "
                "name suffix %d\n", device.major, device.minor,
                device.suffix);

         exit(EXIT_SUCCESS);
 }

/* Demo */
A demo of how binderfs works can be found under [2].

[1]: https://goo.gl/JL2tfX
[2]: https://asciinema.org/a/zYUCqL7OySASWK9S2yVFq2sxM

Cc: Martijn Coenen <maco@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 drivers/android/Kconfig                 |  12 +
 drivers/android/Makefile                |   1 +
 drivers/android/binder.c                |  25 +-
 drivers/android/binder_internal.h       |  52 +++
 drivers/android/binderfs.c              | 576 ++++++++++++++++++++++++
 include/uapi/linux/android/binder_ctl.h |  33 ++
 include/uapi/linux/magic.h              |   1 +
 7 files changed, 683 insertions(+), 17 deletions(-)
 create mode 100644 drivers/android/binder_internal.h
 create mode 100644 drivers/android/binderfs.c
 create mode 100644 include/uapi/linux/android/binder_ctl.h

diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig
index 51e8250d113f..4c190f8d1f4c 100644
--- a/drivers/android/Kconfig
+++ b/drivers/android/Kconfig
@@ -20,6 +20,18 @@ config ANDROID_BINDER_IPC
 	  Android process, using Binder to identify, invoke and pass arguments
 	  between said processes.
 
+config ANDROID_BINDERFS
+	bool "Android Binderfs filesystem"
+	depends on ANDROID_BINDER_IPC
+	default n
+	---help---
+	  Binderfs is a pseudo-filesystem for the Android Binder IPC driver
+	  which can be mounted per-ipc namespace allowing to run multiple
+	  instances of Android.
+	  Each binderfs mount initially only contains a binder-control device.
+	  It can be used to dynamically allocate new binder IPC devices via
+	  ioctls.
+
 config ANDROID_BINDER_DEVICES
 	string "Android Binder devices"
 	depends on ANDROID_BINDER_IPC
diff --git a/drivers/android/Makefile b/drivers/android/Makefile
index a01254c43ee3..c7856e3200da 100644
--- a/drivers/android/Makefile
+++ b/drivers/android/Makefile
@@ -1,4 +1,5 @@
 ccflags-y += -I$(src)			# needed for trace events
 
+obj-$(CONFIG_ANDROID_BINDERFS)		+= binderfs.o
 obj-$(CONFIG_ANDROID_BINDER_IPC)	+= binder.o binder_alloc.o
 obj-$(CONFIG_ANDROID_BINDER_IPC_SELFTEST) += binder_alloc_selftest.o
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index cb30a524d16d..3ed8bc4b7451 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -78,6 +78,7 @@
 #include <asm/cacheflush.h>
 
 #include "binder_alloc.h"
+#include "binder_internal.h"
 #include "binder_trace.h"
 
 static HLIST_HEAD(binder_deferred_list);
@@ -262,20 +263,6 @@ static struct binder_transaction_log_entry *binder_transaction_log_add(
 	return e;
 }
 
-struct binder_context {
-	struct binder_node *binder_context_mgr_node;
-	struct mutex context_mgr_node_lock;
-
-	kuid_t binder_context_mgr_uid;
-	const char *name;
-};
-
-struct binder_device {
-	struct hlist_node hlist;
-	struct miscdevice miscdev;
-	struct binder_context context;
-};
-
 /**
  * struct binder_work - work enqueued on a worklist
  * @entry:             node enqueued on list
@@ -4935,8 +4922,12 @@ static int binder_open(struct inode *nodp, struct file *filp)
 	proc->tsk = current->group_leader;
 	INIT_LIST_HEAD(&proc->todo);
 	proc->default_priority = task_nice(current);
-	binder_dev = container_of(filp->private_data, struct binder_device,
-				  miscdev);
+	/* binderfs stashes devices in i_private */
+	if (is_binderfs_device(nodp))
+		binder_dev = nodp->i_private;
+	else
+		binder_dev = container_of(filp->private_data,
+					  struct binder_device, miscdev);
 	proc->context = &binder_dev->context;
 	binder_alloc_init(&proc->alloc);
 
@@ -5724,7 +5715,7 @@ static int binder_transaction_log_show(struct seq_file *m, void *unused)
 	return 0;
 }
 
-static const struct file_operations binder_fops = {
+const struct file_operations binder_fops = {
 	.owner = THIS_MODULE,
 	.poll = binder_poll,
 	.unlocked_ioctl = binder_ioctl,
diff --git a/drivers/android/binder_internal.h b/drivers/android/binder_internal.h
new file mode 100644
index 000000000000..6879478196aa
--- /dev/null
+++ b/drivers/android/binder_internal.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_BINDER_INTERNAL_H
+#define _LINUX_BINDER_INTERNAL_H
+
+#include <linux/export.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/stddef.h>
+#include <linux/types.h>
+#include <linux/uidgid.h>
+
+struct binder_context {
+	struct binder_node *binder_context_mgr_node;
+	struct mutex context_mgr_node_lock;
+	kuid_t binder_context_mgr_uid;
+	const char *name;
+};
+
+/**
+ * struct binder_device - information about a binder device node
+ * @hlist:          list of binder devices (only used for devices requested via
+ *                  CONFIG_ANDROID_BINDER_DEVICES)
+ * @miscdev:        information about a binder character device node
+ * @context:        binder context information
+ * @binderfs_inode: This is the inode of the root dentry of the super block
+ *                  belonging to a binderfs mount.
+ * @idr:            The idr used to construct the name of the binder device in
+ *                  its binderfs mount.
+ */
+struct binder_device {
+	struct hlist_node hlist;
+	struct miscdevice miscdev;
+	struct binder_context context;
+	struct inode *binderfs_inode;
+	int idr;
+};
+
+extern const struct file_operations binder_fops;
+
+#ifdef CONFIG_ANDROID_BINDERFS
+extern bool is_binderfs_device(const struct inode *inode);
+#else
+static inline bool is_binderfs_device(const struct inode *inode)
+{
+	return false;
+}
+#endif
+
+#endif /* _LINUX_BINDER_INTERNAL_H */
diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c
new file mode 100644
index 000000000000..7f88e811adce
--- /dev/null
+++ b/drivers/android/binderfs.c
@@ -0,0 +1,576 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/compiler_types.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/gfp.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/ipc_namespace.h>
+#include <linux/kdev_t.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/magic.h>
+#include <linux/major.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/mount.h>
+#include <linux/parser.h>
+#include <linux/radix-tree.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock_types.h>
+#include <linux/stddef.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/user_namespace.h>
+#include <linux/xarray.h>
+#include <uapi/asm-generic/errno-base.h>
+#include <uapi/linux/android/binder.h>
+#include <uapi/linux/android/binder_ctl.h>
+
+#include "binder_internal.h"
+
+#define FIRST_INODE 1
+#define SECOND_INODE 2
+#define INODE_OFFSET 3
+#define INTSTRLEN 21
+#define BINDERFS_MAX_MINOR (1U << MINORBITS)
+
+static struct vfsmount *binderfs_mnt;
+
+static dev_t binderfs_dev;
+static DEFINE_MUTEX(binderfs_minors_mutex);
+static DEFINE_IDR(binderfs_minors);
+
+/**
+ * binderfs_info - information about a binderfs mount
+ * @ipc_ns:         The ipc namespace the binderfs mount belongs to.
+ * @control_dentry: This records the dentry of this binderfs mount
+ *                  binder-control device.
+ * @idr:            Currently occupied idrs in this binderfs mount.
+ * @ctl_mutex:      Protect idr and devices from concurrent access
+ * @root_uid:       uid that needs to be used when a new binder device is
+ *                  created.
+ * @root_gid:       gid that needs to be used when a new binder device is
+ *                  created.
+ */
+struct binderfs_info {
+	struct ipc_namespace *ipc_ns;
+	struct dentry *control_dentry;
+	struct idr idr;
+	struct mutex ctl_mutex;
+	kuid_t root_uid;
+	kgid_t root_gid;
+
+};
+
+static inline struct binderfs_info *BINDERFS_I(struct inode *inode)
+{
+	return inode->i_sb->s_fs_info;
+}
+
+bool is_binderfs_device(const struct inode *inode)
+{
+	if (inode->i_sb->s_magic == BINDERFS_SUPER_MAGIC)
+		return true;
+
+	return false;
+}
+
+/**
+ * binderfs_new_inode - allocate inode from super block of a binderfs mount
+ * @ref_inode: inode from wich the super block will be taken
+ * @userp:     buffer to copy information about new device for userspace to
+ * @device:    binder device for which the new inode will be allocated
+ *
+ * This function will allocate a new inode from the super block of the
+ * filesystem mount and attach a dentry to that inode.
+ * Minor numbers are limited and tracked globally in binderfs_minors.
+ * The function will stash a struct binder_device for the specific binder
+ * device in i_private of the inode.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static int binderfs_new_inode(struct inode *ref_inode,
+			      struct binderfs_device __user *userp,
+			      struct binder_device *device)
+{
+	int minor, ret;
+	struct dentry *root, *dentry;
+	size_t name_len = sizeof("binder") - 1 + INTSTRLEN;
+	int new_idr = -1;
+	char *name = NULL;
+	struct inode *inode = NULL;
+	struct binderfs_device bfs_device = { 0 };
+	struct super_block *sb = ref_inode->i_sb;
+	struct binderfs_info *info = sb->s_fs_info;
+
+	/* Reserve new minor number for the new device. */
+	mutex_lock(&binderfs_minors_mutex);
+	minor = idr_alloc(&binderfs_minors, device, 0, BINDERFS_MAX_MINOR + 1,
+			  GFP_KERNEL);
+	mutex_unlock(&binderfs_minors_mutex);
+	if (minor < 0)
+		return minor;
+
+	ret = -ENOMEM;
+	inode = new_inode(sb);
+	if (!inode)
+		goto err;
+
+	inode->i_ino = minor + INODE_OFFSET;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	init_special_inode(inode, S_IFCHR | 0600,
+			   MKDEV(MAJOR(binderfs_dev), minor));
+	inode->i_fop = &binder_fops;
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+	inode->i_private = device;
+
+	name = kmalloc(name_len, GFP_KERNEL);
+	if (!name)
+		goto err;
+
+	/* Allocate new index for binder device name. */
+	ret = idr_alloc(&info->idr, device, 0, BINDERFS_MAX_MINOR + 1,
+			GFP_KERNEL);
+	if (ret < 0)
+		goto err;
+
+	new_idr = ret;
+	ret = snprintf(name, name_len, "binder%d", new_idr);
+	if (ret < 0 || (size_t)ret >= name_len) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	device->binderfs_inode = inode;
+	device->context.binder_context_mgr_uid = INVALID_UID;
+	device->context.name = name;
+	device->idr = new_idr;
+	device->miscdev.name = name;
+	device->miscdev.minor = minor;
+	mutex_init(&device->context.context_mgr_node_lock);
+
+	bfs_device.major = MAJOR(binderfs_dev);
+	bfs_device.minor = minor;
+	bfs_device.suffix = new_idr;
+
+	ret = copy_to_user(userp, &bfs_device, sizeof(bfs_device));
+	if (ret)
+		goto err;
+
+	ret = -ENOMEM;
+	root = sb->s_root;
+	inode_lock(d_inode(root));
+	dentry = d_alloc_name(root, name);
+	if (!dentry) {
+		inode_unlock(d_inode(root));
+		goto err;
+	}
+
+	d_add(dentry, inode);
+	fsnotify_create(root->d_inode, dentry);
+	inode_unlock(d_inode(root));
+
+	return 0;
+
+err:
+	kfree(name);
+	mutex_lock(&binderfs_minors_mutex);
+	idr_remove(&binderfs_minors, minor);
+	mutex_unlock(&binderfs_minors_mutex);
+	if (new_idr >= 0)
+		idr_remove(&info->idr, new_idr);
+	iput(inode);
+
+	return ret;
+}
+
+static int binderfs_binder_device_create(struct inode *inode,
+					 struct binderfs_device __user *userp)
+{
+	struct binder_device *device;
+	int ret;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	ret = binderfs_new_inode(inode, userp, device);
+	if (ret < 0) {
+		kfree(device);
+		return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * binderfs_ctl_ioctl - handle binder device node allocation requests
+ *
+ * The request handler for the binder-control device. All requests operate on
+ * the binderfs mount the binder-control device resides in:
+ * - BINDER_CTL_ADD
+ *   Allocate a new binder device.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static long binder_ctl_ioctl(struct file *file, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct binderfs_info *info;
+	int ret = -EINVAL;
+	struct inode *inode = file_inode(file);
+	struct binderfs_device *device = (struct binderfs_device __user *)arg;
+
+	info = BINDERFS_I(inode);
+	mutex_lock(&info->ctl_mutex);
+	switch (cmd) {
+	case BINDER_CTL_ADD:
+		ret = binderfs_binder_device_create(inode, device);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&info->ctl_mutex);
+
+	return ret;
+}
+
+static void binderfs_evict_inode(struct inode *inode)
+{
+	struct binderfs_info *info = BINDERFS_I(inode);
+	struct binder_device *device = inode->i_private;
+
+	clear_inode(inode);
+
+	if (!device)
+		return;
+
+	mutex_lock(&binderfs_minors_mutex);
+	idr_remove(&binderfs_minors, device->miscdev.minor);
+	mutex_unlock(&binderfs_minors_mutex);
+
+	/* binder-control does not have an idr allocation */
+	if (device->idr >= 0) {
+		mutex_lock(&info->ctl_mutex);
+		idr_remove(&info->idr, device->idr);
+		mutex_unlock(&info->ctl_mutex);
+	}
+
+	kfree(device->context.name);
+	kfree(device);
+}
+
+static const struct super_operations binderfs_super_ops = {
+	.statfs = simple_statfs,
+	.evict_inode = binderfs_evict_inode,
+};
+
+static int binderfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			   struct inode *new_dir, struct dentry *new_dentry,
+			   unsigned int flags)
+{
+	struct inode *inode = d_inode(old_dentry);
+
+	/* binderfs doesn't support directories. */
+	if (d_is_dir(old_dentry))
+		return -EPERM;
+
+	if (flags & ~RENAME_NOREPLACE)
+		return -EINVAL;
+
+	if (!simple_empty(new_dentry))
+		return -ENOTEMPTY;
+
+	if (d_really_is_positive(new_dentry))
+		simple_unlink(new_dir, new_dentry);
+
+	old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
+		new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
+
+	return 0;
+}
+
+static int binderfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	/*
+	 * The control dentry is only ever touched during mount so checking it
+	 * here should not require us to take lock.
+	 */
+	if (BINDERFS_I(dir)->control_dentry == dentry)
+		return -EPERM;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations binderfs_dir_inode_operations = {
+	.lookup = simple_lookup,
+	.rename = binderfs_rename,
+	.unlink = binderfs_unlink,
+};
+
+static int binderfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct binderfs_info *info;
+	struct inode *inode = NULL;
+	struct ipc_namespace *ipc_ns = sb->s_fs_info;
+
+	get_ipc_ns(ipc_ns);
+
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+
+	/*
+	 * The binderfs filesystem can be mounted by userns root in a
+	 * non-initial userns. By default such mounts have the SB_I_NODEV flag
+	 * set in s_iflags to prevent security issues where userns root can
+	 * just create random device nodes via mknod() since it owns the
+	 * filesystem mount. But binderfs does not allow to create any files
+	 * including devices nodes. The only way to create binder devices nodes
+	 * is through the binder-control device which userns root is explicitly
+	 * allowed to do. So removing the SB_I_NODEV flag from s_iflags is both
+	 * necessary and safe.
+	 */
+	sb->s_iflags &= ~SB_I_NODEV;
+	sb->s_iflags |= SB_I_NOEXEC;
+	sb->s_magic = BINDERFS_SUPER_MAGIC;
+	sb->s_op = &binderfs_super_ops;
+	sb->s_time_gran = 1;
+
+	info = kzalloc(sizeof(struct binderfs_info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	info->ipc_ns = ipc_ns;
+	info->root_gid = make_kgid(sb->s_user_ns, 0);
+	if (!gid_valid(info->root_gid))
+		info->root_gid = GLOBAL_ROOT_GID;
+	info->root_uid = make_kuid(sb->s_user_ns, 0);
+	if (!uid_valid(info->root_uid))
+		info->root_uid = GLOBAL_ROOT_UID;
+	idr_init(&info->idr);
+	mutex_init(&info->ctl_mutex);
+
+	sb->s_fs_info = info;
+
+	inode = new_inode(sb);
+	if (!inode)
+		goto err;
+
+	inode->i_ino = FIRST_INODE;
+	inode->i_fop = &simple_dir_operations;
+	inode->i_mode = 0755;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_op = &binderfs_dir_inode_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (sb->s_root)
+		return 0;
+
+err:
+	if (inode)
+		iput(inode);
+	kfree(info);
+	put_ipc_ns(ipc_ns);
+
+	return -ENOMEM;
+}
+
+static const struct file_operations binder_ctl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.unlocked_ioctl	= binder_ctl_ioctl,
+	.compat_ioctl	= binder_ctl_ioctl,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * binderfs_binder_ctl_create - create a new binder-control device
+ * @sb: super block of the binderfs mount
+ *
+ * This function creates a new binder-control device node in the binderfs mount
+ * referred to by @sb.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static int binderfs_binder_ctl_create(struct super_block *sb)
+{
+	int minor;
+	struct inode *inode;
+	struct dentry *dentry;
+	struct binder_device *device;
+	struct dentry *root = sb->s_root;
+	struct binderfs_info *info = sb->s_fs_info;
+	int ret = 0;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	inode_lock(d_inode(root));
+
+	if (info->control_dentry)
+		goto out;
+
+	ret = -ENOMEM;
+	inode = new_inode(sb);
+	if (!inode)
+		goto out;
+
+	/* Reserve a new minor number for the new device. */
+	mutex_lock(&binderfs_minors_mutex);
+	minor = idr_alloc(&binderfs_minors, device, 0, BINDERFS_MAX_MINOR + 1,
+			  GFP_KERNEL);
+	mutex_unlock(&binderfs_minors_mutex);
+	if (minor < 0) {
+		ret = minor;
+		goto out;
+	}
+
+	inode->i_ino = SECOND_INODE;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	init_special_inode(inode, S_IFCHR | 0600,
+			   MKDEV(MAJOR(binderfs_dev), minor));
+	inode->i_fop = &binder_ctl_fops;
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+	inode->i_private = device;
+
+	device->binderfs_inode = inode;
+	device->miscdev.minor = minor;
+	device->idr = -1;
+
+	dentry = d_alloc_name(root, "binder-control");
+	if (!dentry)
+		goto out;
+
+	info->control_dentry = dentry;
+	d_add(dentry, inode);
+
+	ret = 0;
+out:
+	inode_unlock(d_inode(root));
+	if (ret)
+		kfree(device);
+	return ret;
+}
+
+
+static int binderfs_test_super(struct super_block *sb, void *data)
+{
+	struct binderfs_info *info = sb->s_fs_info;
+
+	if (info)
+		return info->ipc_ns == data;
+
+	return 0;
+}
+
+static int binderfs_set_super(struct super_block *sb, void *data)
+{
+	sb->s_fs_info = data;
+	return set_anon_super(sb, NULL);
+}
+
+static void binderfs_free_binderfs_info(struct super_block *sb)
+{
+	struct binderfs_info *info = sb->s_fs_info;
+
+	put_ipc_ns(info->ipc_ns);
+	kfree(info);
+}
+
+static struct dentry *binderfs_mount(struct file_system_type *fs_type,
+				     int flags, const char *dev_name,
+				     void *data)
+{
+	int err;
+	bool new_sb;
+	struct super_block *sb;
+	struct ipc_namespace *ipc_ns = current->nsproxy->ipc_ns;
+
+	if (!ns_capable(ipc_ns->user_ns, CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	sb = sget_userns(fs_type, binderfs_test_super, binderfs_set_super,
+			 flags, ipc_ns->user_ns, ipc_ns);
+	if (IS_ERR(sb))
+		return ERR_CAST(sb);
+
+	new_sb = !sb->s_root;
+	if (new_sb) {
+		err = binderfs_fill_super(sb, data, flags & SB_SILENT ? 1 : 0);
+		if (err) {
+			deactivate_locked_super(sb);
+			return ERR_PTR(err);
+		}
+		dget(sb->s_root);
+		sb->s_flags |= SB_ACTIVE;
+	}
+
+	err = binderfs_binder_ctl_create(sb);
+	if (err) {
+		if (new_sb) {
+			binderfs_free_binderfs_info(sb);
+			dput(sb->s_root);
+			deactivate_locked_super(sb);
+		}
+		return ERR_PTR(err);
+	}
+
+	return sb->s_root;
+}
+
+static void binderfs_kill_super(struct super_block *sb)
+{
+	struct binderfs_info *info = sb->s_fs_info;
+	struct ipc_namespace *ipc_ns = info->ipc_ns;
+
+	kfree(info);
+	kill_litter_super(sb);
+	if (ipc_ns)
+		put_ipc_ns(ipc_ns);
+}
+
+static struct file_system_type binder_fs_type = {
+	.name		= "binder",
+	.mount		= binderfs_mount,
+	.kill_sb	= binderfs_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
+};
+
+static int __init init_binderfs(void)
+{
+	int ret;
+
+	/* Allocate new major number for binderfs. */
+	ret = alloc_chrdev_region(&binderfs_dev, 0, BINDERFS_MAX_MINOR,
+				  "binder");
+	if (ret < 0)
+		return ret;
+
+	ret = register_filesystem(&binder_fs_type);
+	if (ret) {
+		unregister_chrdev_region(binderfs_dev, BINDERFS_MAX_MINOR);
+		return ret;
+	}
+
+	binderfs_mnt = kern_mount(&binder_fs_type);
+	if (IS_ERR(binderfs_mnt)) {
+		ret = PTR_ERR(binderfs_mnt);
+		binderfs_mnt = NULL;
+		unregister_filesystem(&binder_fs_type);
+		unregister_chrdev_region(binderfs_dev, BINDERFS_MAX_MINOR);
+	}
+
+	return ret;
+}
+
+device_initcall(init_binderfs);
diff --git a/include/uapi/linux/android/binder_ctl.h b/include/uapi/linux/android/binder_ctl.h
new file mode 100644
index 000000000000..a706608dd67c
--- /dev/null
+++ b/include/uapi/linux/android/binder_ctl.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2018 Canonical Ltd.
+ *
+ */
+
+#ifndef _UAPI_LINUX_BINDER_CTL_H
+#define _UAPI_LINUX_BINDER_CTL_H
+
+#include <linux/android/binder.h>
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/**
+ * struct binderfs_device - retrieve information about a new binder device
+ * @major:  major number allocated for binderfs binder devices
+ * @minor:  minor number allocated for the new binderfs binder device
+ * @suffix: index used as the name for the new binder device
+ *
+ */
+struct binderfs_device {
+	__u32 major;
+	__u32 minor;
+	__u32 suffix;
+};
+
+/**
+ * Allocate a new binder device.
+ */
+#define BINDER_CTL_ADD _IOWR('b', 1, struct binderfs_device)
+
+#endif /* _UAPI_LINUX_BINDER_CTL_H */
+
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 96c24478d8ce..f8c00045d537 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -73,6 +73,7 @@
 #define DAXFS_MAGIC             0x64646178
 #define BINFMTFS_MAGIC          0x42494e4d
 #define DEVPTS_SUPER_MAGIC	0x1cd1
+#define BINDERFS_SUPER_MAGIC	0x6c6f6f70
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define PIPEFS_MAGIC            0x50495045
 #define PROC_SUPER_MAGIC	0x9fa0
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] binder: implement binderfs
  2018-12-04 13:12 [PATCH] binder: implement binderfs Christian Brauner
@ 2018-12-05 20:01 ` Greg KH
  2018-12-05 21:42   ` Christian Brauner
  0 siblings, 1 reply; 5+ messages in thread
From: Greg KH @ 2018-12-05 20:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: tkjos, maco, linux-kernel, devel, kilobyte, darrick.wong,
	chouryzhou, david, arve, joel, Todd Kjos

On Tue, Dec 04, 2018 at 02:12:39PM +0100, Christian Brauner wrote:
> As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
> implementation of binderfs. If you want to skip reading and just see how it
> works, please go to [2].

First off, thanks for doing this so quickly.  I think the overall idea
and implementation is great, I just have some minor issues with the user
api:

> /* binder-control */
> Each new binderfs instance comes with a binder-control device. No other
> devices will be present at first. The binder-control device can be used to
> dynamically allocate binder devices. All requests operate on the binderfs
> mount the binder-control device resides in:
> - BINDER_CTL_ADD
>   Allocate a new binder device.
> Assuming a new instance of binderfs has been mounted at /dev/binderfs via
> mount -t binderfs binderfs /dev/binderfs. Then a request to create a new
> binder device can be made via:
> 
> struct binderfs_device device = {0};
> int fd = open("/dev/binderfs/binder-control", O_RDWR);
> ioctl(fd, BINDER_CTL_ADD, &device);
> 
> The struct binderfs_device will be used to return the major and minor
> number, as well as the index used as the new name for the device.
> Binderfs devices can simply be removed via unlink().

I think you should provide a name in the BINDER_CTL_ADD command.  That
way you can easily emulate the existing binder queues, and it saves you
a create/rename sequence that you will be forced to do otherwise.  Why
not do it just in a single command?

That way also you don't need to care about the major/minor number at
all.  Userspace should never need to worry about that, use a name,
that's the best thing.  Also, it allows you to drop the use of the idr,
making the kernel code simpler overall.

> /* Implementation details */
> - When binderfs is registered as a new filesystem it will dynamically
>   allocate a new major number. The allocated major number will be returned
>   in struct binderfs_device when a new binder device is allocated.

Why does userspace care about major/minor numbers at all?  You should
just be able to deal with the binder "names", that's all that userspace
uses normally as you are not calling mknod() yourself.

>   Minor numbers that have been given out are tracked in a global idr struct
>   that is capped at BINDERFS_MAX_MINOR. The minor number tracker is
>   protected by a global mutex. This is the only point of contention between
>   binderfs mounts.

I doubt this will be any real contention given that setting up / tearing
down binder mounts is going to be rare, right?  Well, hopefully, who
knows with some container systems...

> - The naming scheme for binder devices is binder%d. Each binderfs mount
>   starts numbering of new binder devices at 0 up to n. The indeces used in
>   constructing the name are tracked in a struct idr that is per-binderfs
>   super block.

Again, let userspace pick the name, as you will have to rename it anyway
to get userspace to work properly with it.

I'll stop repeating myself now :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] binder: implement binderfs
  2018-12-05 20:01 ` Greg KH
@ 2018-12-05 21:42   ` Christian Brauner
  2018-12-06 14:04     ` Greg KH
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Brauner @ 2018-12-05 21:42 UTC (permalink / raw)
  To: Greg KH
  Cc: tkjos, maco, linux-kernel, devel, kilobyte, darrick.wong,
	chouryzhou, david, arve, joel, Todd Kjos

On Wed, Dec 05, 2018 at 09:01:45PM +0100, Greg KH wrote:
> On Tue, Dec 04, 2018 at 02:12:39PM +0100, Christian Brauner wrote:
> > As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
> > implementation of binderfs. If you want to skip reading and just see how it
> > works, please go to [2].
> 
> First off, thanks for doing this so quickly.  I think the overall idea
> and implementation is great, I just have some minor issues with the user
> api:

Thanks! :)

> 
> > /* binder-control */
> > Each new binderfs instance comes with a binder-control device. No other
> > devices will be present at first. The binder-control device can be used to
> > dynamically allocate binder devices. All requests operate on the binderfs
> > mount the binder-control device resides in:
> > - BINDER_CTL_ADD
> >   Allocate a new binder device.
> > Assuming a new instance of binderfs has been mounted at /dev/binderfs via
> > mount -t binderfs binderfs /dev/binderfs. Then a request to create a new
> > binder device can be made via:
> > 
> > struct binderfs_device device = {0};
> > int fd = open("/dev/binderfs/binder-control", O_RDWR);
> > ioctl(fd, BINDER_CTL_ADD, &device);
> > 
> > The struct binderfs_device will be used to return the major and minor
> > number, as well as the index used as the new name for the device.
> > Binderfs devices can simply be removed via unlink().
> 
> I think you should provide a name in the BINDER_CTL_ADD command.  That
> way you can easily emulate the existing binder queues, and it saves you
> a create/rename sequence that you will be forced to do otherwise.  Why
> not do it just in a single command?

Sounds reasonable. How do you feel about capping the name length at 255
bytes aka the standard Linux file name length (e.g. xfs, ext4 etc.)?

#define BINDERFS_NAME_MAX 255

struct binderfs_device {
        char name[BINDERFS_NAME_MAX + 1];
	__u32 major;
	__u32 minor;
}

> 
> That way also you don't need to care about the major/minor number at
> all.  Userspace should never need to worry about that, use a name,
> that's the best thing.  Also, it allows you to drop the use of the idr,
> making the kernel code simpler overall.
> 
> > /* Implementation details */
> > - When binderfs is registered as a new filesystem it will dynamically
> >   allocate a new major number. The allocated major number will be returned
> >   in struct binderfs_device when a new binder device is allocated.
> 
> Why does userspace care about major/minor numbers at all?  You should

Userspace cares for the sake of the devices cgroup which operates on
device numnbers to restrict access to devices. Since binderfs doesn't
have a static major number returning that information is helpful.
One example is a managining process sharing a binderfs mounts between
multiple ipc+mount namespaces via bind-mounts to avoid having separate
binderfs mounts. If the managing process only wants to grant access to binder0
and no other device for a given ipc+mnt namespace combination then it
can use the devices cgroup but for that it needs to know the major and
minor number.

> just be able to deal with the binder "names", that's all that userspace
> uses normally as you are not calling mknod() yourself.
> 
> >   Minor numbers that have been given out are tracked in a global idr struct
> >   that is capped at BINDERFS_MAX_MINOR. The minor number tracker is
> >   protected by a global mutex. This is the only point of contention between
> >   binderfs mounts.
> 
> I doubt this will be any real contention given that setting up / tearing
> down binder mounts is going to be rare, right?  Well, hopefully, who
> knows with some container systems...

Yeah, it's very unlikely. Device allocation is a rare event that is
basically done once for most interesting use-cases. If one cares so much
about performance bind-mounts + device cgroup restrictions are possible
solution to this problem.

> 
> > - The naming scheme for binder devices is binder%d. Each binderfs mount
> >   starts numbering of new binder devices at 0 up to n. The indeces used in
> >   constructing the name are tracked in a struct idr that is per-binderfs
> >   super block.
> 
> Again, let userspace pick the name, as you will have to rename it anyway
> to get userspace to work properly with it.
> 
> I'll stop repeating myself now :)

Thanks for the feedback! :)

Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] binder: implement binderfs
  2018-12-05 21:42   ` Christian Brauner
@ 2018-12-06 14:04     ` Greg KH
  2018-12-06 17:45       ` Christian Brauner
  0 siblings, 1 reply; 5+ messages in thread
From: Greg KH @ 2018-12-06 14:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: tkjos, maco, linux-kernel, devel, kilobyte, darrick.wong,
	chouryzhou, david, arve, joel, Todd Kjos

On Wed, Dec 05, 2018 at 10:42:06PM +0100, Christian Brauner wrote:
> On Wed, Dec 05, 2018 at 09:01:45PM +0100, Greg KH wrote:
> > > /* binder-control */
> > > Each new binderfs instance comes with a binder-control device. No other
> > > devices will be present at first. The binder-control device can be used to
> > > dynamically allocate binder devices. All requests operate on the binderfs
> > > mount the binder-control device resides in:
> > > - BINDER_CTL_ADD
> > >   Allocate a new binder device.
> > > Assuming a new instance of binderfs has been mounted at /dev/binderfs via
> > > mount -t binderfs binderfs /dev/binderfs. Then a request to create a new
> > > binder device can be made via:
> > > 
> > > struct binderfs_device device = {0};
> > > int fd = open("/dev/binderfs/binder-control", O_RDWR);
> > > ioctl(fd, BINDER_CTL_ADD, &device);
> > > 
> > > The struct binderfs_device will be used to return the major and minor
> > > number, as well as the index used as the new name for the device.
> > > Binderfs devices can simply be removed via unlink().
> > 
> > I think you should provide a name in the BINDER_CTL_ADD command.  That
> > way you can easily emulate the existing binder queues, and it saves you
> > a create/rename sequence that you will be forced to do otherwise.  Why
> > not do it just in a single command?
> 
> Sounds reasonable. How do you feel about capping the name length at 255
> bytes aka the standard Linux file name length (e.g. xfs, ext4 etc.)?
> 
> #define BINDERFS_NAME_MAX 255
> 
> struct binderfs_device {
>         char name[BINDERFS_NAME_MAX + 1];

__u8 is the proper type to cross the user/kernel boundry :)

> 	__u32 major;
> 	__u32 minor;
> }

Yes, limiting it to 255 is fine with me.

> > That way also you don't need to care about the major/minor number at
> > all.  Userspace should never need to worry about that, use a name,
> > that's the best thing.  Also, it allows you to drop the use of the idr,
> > making the kernel code simpler overall.
> > 
> > > /* Implementation details */
> > > - When binderfs is registered as a new filesystem it will dynamically
> > >   allocate a new major number. The allocated major number will be returned
> > >   in struct binderfs_device when a new binder device is allocated.
> > 
> > Why does userspace care about major/minor numbers at all?  You should
> 
> Userspace cares for the sake of the devices cgroup which operates on
> device numnbers to restrict access to devices. Since binderfs doesn't
> have a static major number returning that information is helpful.

Ugh, ok, that makes sense.  If we really want to make the kernel
interface simpler, drop the major/minor and then have userspace do the
stat(2) to see what the major/minor number they care about is.

But yeah, keeping it here makes everyone's life simpler, the kernel
already knows this, and it's trivial to pass it back to userspace this
way.

Care to make this change and resend?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] binder: implement binderfs
  2018-12-06 14:04     ` Greg KH
@ 2018-12-06 17:45       ` Christian Brauner
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2018-12-06 17:45 UTC (permalink / raw)
  To: Greg KH
  Cc: tkjos, maco, linux-kernel, devel, kilobyte, darrick.wong,
	chouryzhou, david, arve, joel, Todd Kjos

On Thu, Dec 06, 2018 at 03:04:03PM +0100, Greg KH wrote:
> On Wed, Dec 05, 2018 at 10:42:06PM +0100, Christian Brauner wrote:
> > On Wed, Dec 05, 2018 at 09:01:45PM +0100, Greg KH wrote:
> > > > /* binder-control */
> > > > Each new binderfs instance comes with a binder-control device. No other
> > > > devices will be present at first. The binder-control device can be used to
> > > > dynamically allocate binder devices. All requests operate on the binderfs
> > > > mount the binder-control device resides in:
> > > > - BINDER_CTL_ADD
> > > >   Allocate a new binder device.
> > > > Assuming a new instance of binderfs has been mounted at /dev/binderfs via
> > > > mount -t binderfs binderfs /dev/binderfs. Then a request to create a new
> > > > binder device can be made via:
> > > > 
> > > > struct binderfs_device device = {0};
> > > > int fd = open("/dev/binderfs/binder-control", O_RDWR);
> > > > ioctl(fd, BINDER_CTL_ADD, &device);
> > > > 
> > > > The struct binderfs_device will be used to return the major and minor
> > > > number, as well as the index used as the new name for the device.
> > > > Binderfs devices can simply be removed via unlink().
> > > 
> > > I think you should provide a name in the BINDER_CTL_ADD command.  That
> > > way you can easily emulate the existing binder queues, and it saves you
> > > a create/rename sequence that you will be forced to do otherwise.  Why
> > > not do it just in a single command?
> > 
> > Sounds reasonable. How do you feel about capping the name length at 255
> > bytes aka the standard Linux file name length (e.g. xfs, ext4 etc.)?
> > 
> > #define BINDERFS_NAME_MAX 255
> > 
> > struct binderfs_device {
> >         char name[BINDERFS_NAME_MAX + 1];
> 
> __u8 is the proper type to cross the user/kernel boundry :)

Will switch. :)

> 
> > 	__u32 major;
> > 	__u32 minor;
> > }
> 
> Yes, limiting it to 255 is fine with me.

Perfect!

> 
> > > That way also you don't need to care about the major/minor number at
> > > all.  Userspace should never need to worry about that, use a name,
> > > that's the best thing.  Also, it allows you to drop the use of the idr,
> > > making the kernel code simpler overall.
> > > 
> > > > /* Implementation details */
> > > > - When binderfs is registered as a new filesystem it will dynamically
> > > >   allocate a new major number. The allocated major number will be returned
> > > >   in struct binderfs_device when a new binder device is allocated.
> > > 
> > > Why does userspace care about major/minor numbers at all?  You should
> > 
> > Userspace cares for the sake of the devices cgroup which operates on
> > device numnbers to restrict access to devices. Since binderfs doesn't
> > have a static major number returning that information is helpful.
> 
> Ugh, ok, that makes sense.  If we really want to make the kernel
> interface simpler, drop the major/minor and then have userspace do the
> stat(2) to see what the major/minor number they care about is.
> 
> But yeah, keeping it here makes everyone's life simpler, the kernel
> already knows this, and it's trivial to pass it back to userspace this
> way.
> 
> Care to make this change and resend?

For sure. I have a long-haul flight for ~15h so by the time I land I
have a new version I can send out. :)

Thanks!
Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-12-06 17:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-04 13:12 [PATCH] binder: implement binderfs Christian Brauner
2018-12-05 20:01 ` Greg KH
2018-12-05 21:42   ` Christian Brauner
2018-12-06 14:04     ` Greg KH
2018-12-06 17:45       ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).