LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/8] loopfs
@ 2020-04-08 15:21 Christian Brauner
  2020-04-08 15:21 ` [PATCH 1/8] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Hey everyone,

After having been pinged about this by various people recently here's loopfs.

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner and more complete
approach.

To experiment, the patchset can be found in the following locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
https://gitlab.com/brauner/linux/-/commits/loopfs
https://github.com/brauner/linux/tree/loopfs

One of the use-cases for loopfs is to allow to dynamically allocate loop
devices in sandboxed workloads without exposing /dev or
/dev/loop-control to the workload in question and without having to
implement a complex and also racy protocol to send around file
descriptors for loop devices. With loopfs each mount is a new instance,
i.e. loop devices created in one loopfs instance are independent of any
loop devices created in another loopfs instance. This allows
sufficiently privileged tools to have their own private stash of loop
device instances. Dmitry has expressed his desire to use this for
syzkaller in a private discussion. And various parties that want to use
it are Cced here too.

In addition, the loopfs filesystem can be mounted by user namespace root
and is thus suitable for use in containers. Combined with syscall
interception this makes it possible to securely delegate mounting of
images on loop devices, i.e. when a user calls mount -o loop <image>
<mountpoint> it will be possible to completely setup the loop device.
The final mount syscall to actually perform the mount will be handled
through syscall interception and be performed by a sufficiently
privileged process. Syscall interception is already supported through a
new seccomp feature we implemented in [1] and extended in [2] and is
actively used in production workloads. The additional loopfs work will
be used there and in various other workloads too. You'll find a short
illustration how this works with syscall interception below in [4].

The number of loop devices available to a loopfs instance can be limited
by setting the "max" mount option to a positive integer. This e.g.
allows sufficiently privileged processes to dynamically enforce a limit
on the number of devices. This limit is dynamic in contrast to the
max_loop module option in that a sufficiently privileged process can
update it with a simple remount operation.

The loopfs filesystem is placed under a new config option and special
care has been taken to not introduce any new code when users do not
select this config option.

Thanks!
Christian

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
[4]:
     root@f1:~# cat /proc/self/uid_map
              0     100000 1000000000
     root@f1:~# cat /proc/self/gid_map
              0     100000 1000000000
     root@f1:~# mkdir /dev/loopfs
     root@f1:~# mount -t loop loop /dev/loopfs/
     root@f1:~# ln -sf /dev/loopfs/loop-control /dev/loop-control
     root@f1:~# losetup -f
     /dev/loop9
     root@f1:~# ln -sf /dev/loopfs/loop9 /dev/loop9
     root@f1:~# ls -al /sys/class/block/loop9
     lrwxrwxrwx 1 root root 0 Apr  8 14:53 /sys/class/block/loop9 -> ../../devices/virtual/block/loop9
     root@f1:~# ls -al /sys/class/block/loop9/
     total 0
     drwxr-xr-x  9 root   root       0 Apr  8 14:53 .
     drwxr-xr-x 13 nobody nogroup    0 Apr  8 14:53 ..
     -r--r--r--  1 root   root    4096 Apr  8 14:53 alignment_offset
     lrwxrwxrwx  1 nobody nogroup    0 Apr  8 14:53 bdi -> ../../bdi/7:9
     -r--r--r--  1 root   root    4096 Apr  8 14:53 capability
     -r--r--r--  1 root   root    4096 Apr  8 14:53 dev
     -r--r--r--  1 root   root    4096 Apr  8 14:53 discard_alignment
     -r--r--r--  1 root   root    4096 Apr  8 14:53 events
     -r--r--r--  1 root   root    4096 Apr  8 14:53 events_async
     -rw-r--r--  1 root   root    4096 Apr  8 14:53 events_poll_msecs
     -r--r--r--  1 root   root    4096 Apr  8 14:53 ext_range
     -r--r--r--  1 root   root    4096 Apr  8 14:53 hidden
     drwxr-xr-x  2 nobody nogroup    0 Apr  8 14:53 holders
     -r--r--r--  1 root   root    4096 Apr  8 14:53 inflight
     drwxr-xr-x  2 nobody nogroup    0 Apr  8 14:53 integrity
     drwxr-xr-x  3 nobody nogroup    0 Apr  8 14:53 mq
     drwxr-xr-x  2 root   root       0 Apr  8 14:53 power
     drwxr-xr-x  3 nobody nogroup    0 Apr  8 14:53 queue
     -r--r--r--  1 root   root    4096 Apr  8 14:53 range
     -r--r--r--  1 root   root    4096 Apr  8 14:53 removable
     -r--r--r--  1 root   root    4096 Apr  8 14:53 ro
     -r--r--r--  1 root   root    4096 Apr  8 14:53 size
     drwxr-xr-x  2 nobody nogroup    0 Apr  8 14:53 slaves
     -r--r--r--  1 root   root    4096 Apr  8 14:53 stat
     lrwxrwxrwx  1 nobody nogroup    0 Apr  8 14:53 subsystem -> ../../../../class/block
     drwxr-xr-x  2 root   root       0 Apr  8 14:53 trace
     -rw-r--r--  1 root   root    4096 Apr  8 14:53 uevent
     root@f1:~#  
     root@f1:~# stat --file-system /bla.img
       File: "/bla.img"
         ID: 4396dc4f5f3ffe1b Namelen: 255     Type: btrfs
     Block size: 4096       Fundamental block size: 4096
     Blocks: Total: 11230468   Free: 10851929   Available: 10738585
     Inodes: Total: 0          Free: 0
     root@f1:~# mount -o loop /bla.img /opt
     root@f1:~# findmnt | grep opt
     └─/opt                                /dev/loop9                            btrfs       rw,relatime,ssd,space_cache,subvolid=5,subvol=/

Christian Brauner (8):
  kobject_uevent: remove unneeded netlink_ns check
  loopfs: implement loopfs
  loop: use ns_capable for some loop operations
  kernfs: handle multiple namespace tags
  kernfs: let objects opt-in to propagating from the initial namespace
  genhd: add minimal namespace infrastructure
  loopfs: start attaching correct namespace during loop_add()
  loopfs: only show devices in their correct instance

 Documentation/filesystems/sysfs-tagging.txt |   1 -
 MAINTAINERS                                 |   5 +
 block/genhd.c                               |  79 ++++
 drivers/base/devtmpfs.c                     |   4 +-
 drivers/block/Kconfig                       |   4 +
 drivers/block/Makefile                      |   1 +
 drivers/block/loop.c                        | 186 +++++++--
 drivers/block/loop.h                        |   8 +-
 drivers/block/loopfs/Makefile               |   3 +
 drivers/block/loopfs/loopfs.c               | 429 ++++++++++++++++++++
 drivers/block/loopfs/loopfs.h               |  35 ++
 fs/kernfs/dir.c                             |  38 +-
 fs/kernfs/kernfs-internal.h                 |  26 +-
 fs/kernfs/mount.c                           |  11 +-
 fs/sysfs/mount.c                            |  14 +-
 include/linux/device.h                      |   3 +
 include/linux/genhd.h                       |   3 +
 include/linux/kernfs.h                      |  44 +-
 include/linux/kobject_ns.h                  |   7 +-
 include/linux/sysfs.h                       |   8 +-
 include/uapi/linux/magic.h                  |   1 +
 lib/kobject.c                               |  17 +-
 lib/kobject_uevent.c                        |   2 +-
 net/core/net-sysfs.c                        |   6 -
 24 files changed, 834 insertions(+), 101 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h


base-commit: 7111951b8d4973bda27ff663f2cf18b663d15b48
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/8] kobject_uevent: remove unneeded netlink_ns check
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-08 15:21 ` [PATCH 2/8] loopfs: implement loopfs Christian Brauner
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Back when I rewrote large chunks of uevent sending I should have removed
the .netlink_ns method completely after having removed it's last user in
[1]. Let's remove it now and also remove the helper associated with it
that is unused too.

Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 Documentation/filesystems/sysfs-tagging.txt |  1 -
 include/linux/kobject_ns.h                  |  3 ---
 lib/kobject.c                               | 13 -------------
 lib/kobject_uevent.c                        |  2 +-
 net/core/net-sysfs.c                        |  6 ------
 5 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
index c7c8e6438958..51d28dd8b84f 100644
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ b/Documentation/filesystems/sysfs-tagging.txt
@@ -37,6 +37,5 @@ Users of this interface:
 - define a type in the kobj_ns_type enumeration.
 - call kobj_ns_type_register() with its kobj_ns_type_operations which has
   - current_ns() which returns current's namespace
-  - netlink_ns() which returns a socket's namespace
   - initial_ns() which returns the initial namesapce
 - call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 069aa2ebef90..991a9286bcea 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -32,7 +32,6 @@ enum kobj_ns_type {
 /*
  * Callbacks so sysfs can determine namespaces
  *   @grab_current_ns: return a new reference to calling task's namespace
- *   @netlink_ns: return namespace to which a sock belongs (right?)
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
  */
@@ -40,7 +39,6 @@ struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
 	bool (*current_may_mount)(void);
 	void *(*grab_current_ns)(void);
-	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
 };
@@ -52,7 +50,6 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
 bool kobj_ns_current_may_mount(enum kobj_ns_type type);
 void *kobj_ns_grab_current(enum kobj_ns_type type);
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
 void kobj_ns_drop(enum kobj_ns_type type, void *ns);
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 83198cb37d8d..6f07083cc111 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -1092,19 +1092,6 @@ void *kobj_ns_grab_current(enum kobj_ns_type type)
 }
 EXPORT_SYMBOL_GPL(kobj_ns_grab_current);
 
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk)
-{
-	const void *ns = NULL;
-
-	spin_lock(&kobj_ns_type_lock);
-	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
-	    kobj_ns_ops_tbl[type])
-		ns = kobj_ns_ops_tbl[type]->netlink_ns(sk);
-	spin_unlock(&kobj_ns_type_lock);
-
-	return ns;
-}
-
 const void *kobj_ns_initial(enum kobj_ns_type type)
 {
 	const void *ns = NULL;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 7998affa45d4..a45b3eeaa2b9 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -400,7 +400,7 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 	 * are the only tag relevant here since we want to decide which
 	 * network namespaces to broadcast the uevent into.
 	 */
-	if (ops && ops->netlink_ns && kobj->ktype->namespace)
+	if (ops && kobj->ktype->namespace)
 		if (ops->type == KOBJ_NS_TYPE_NET)
 			net = kobj->ktype->namespace(kobj);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4c826b8bf9b1..5d05e13a96cc 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1592,16 +1592,10 @@ static const void *net_initial_ns(void)
 	return &init_net;
 }
 
-static const void *net_netlink_ns(struct sock *sk)
-{
-	return sock_net(sk);
-}
-
 const struct kobj_ns_type_operations net_ns_type_operations = {
 	.type = KOBJ_NS_TYPE_NET,
 	.current_may_mount = net_current_may_mount,
 	.grab_current_ns = net_grab_current_ns,
-	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
 	.drop_ns = net_drop_ns,
 };
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/8] loopfs: implement loopfs
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
  2020-04-08 15:21 ` [PATCH 1/8] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-09  5:39   ` David Rheinsberg
  2020-04-09  7:53   ` Christoph Hellwig
  2020-04-08 15:21 ` [PATCH 3/8] loop: use ns_capable for some loop operations Christian Brauner
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overally good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

One of the use-cases for loopfs is to allow to dynamically allocate loop
devices in sandboxed workloads without exposing /dev or
/dev/loop-control to the workload in question and without having to
implement a complex and also racy protocol to send around file
descriptors for loop devices. With loopfs each mount is a new instance,
i.e. loop devices created in one loopfs instance are independent of any
loop devices created in another loopfs instance. This allows
sufficiently privileged tools to have their own private stash of loop
device instances.

In addition, the loopfs filesystem can be mounted by user namespace root
and is thus suitable for use in containers. Combined with syscall
interception this makes it possible to securely delegate mounting of
images on loop devices, i.e. when a users calls mount -o loop <image>
<mountpoint> it will be possible to completely setup the loop device
(enabled in later patches) and the mount syscall to actually perform the
mount will be handled through syscall interception and be performed by a
sufficiently privileged process. Syscall interception is already
supported through a new seccomp feature we implemented in [1] and
extended in [2] and is actively used in production workloads. The
additional loopfs work will be used there and in various other workloads
too.

The number of loop devices available to a loopfs instance can be limited
by setting the "max" mount option to a positive integer. This e.g.
allows sufficiently privileged processes to dynamically enforce a limit
on the number of devices. This limit is dynamic in contrast to the
max_loop module option in that a sufficiently privileged process can
update it with a simple remount operation.

The loopfs filesystem is placed under a new config option and special
care has been taken to not introduce any new code when users do not
select this config option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 MAINTAINERS                   |   5 +
 drivers/block/Kconfig         |   4 +
 drivers/block/Makefile        |   1 +
 drivers/block/loop.c          | 151 +++++++++---
 drivers/block/loop.h          |   8 +-
 drivers/block/loopfs/Makefile |   3 +
 drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++++++++++++++++
 drivers/block/loopfs/loopfs.h |  35 +++
 include/uapi/linux/magic.h    |   1 +
 9 files changed, 600 insertions(+), 37 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5a5332b3591d..0a2886645f43 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9852,6 +9852,11 @@ S:	Supported
 F:	drivers/message/fusion/
 F:	drivers/scsi/mpt3sas/
 
+LOOPFS FILE SYSTEM
+M:	Christian Brauner <christian.brauner@ubuntu.com>
+S:	Supported
+F:	drivers/block/loopfs/
+
 LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers
 M:	Matthew Wilcox <willy@infradead.org>
 L:	linux-scsi@vger.kernel.org
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 025b1b77b11a..d7ff37d795ad 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -214,6 +214,10 @@ config BLK_DEV_LOOP
 
 	  Most users will answer N here.
 
+config BLK_DEV_LOOPFS
+	bool "Loopback device virtual filesystem support"
+	depends on BLK_DEV_LOOP=y
+
 config BLK_DEV_LOOP_MIN_COUNT
 	int "Number of loop devices to pre-create at init time"
 	depends on BLK_DEV_LOOP
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index a53cc1e3a2d3..24fdd7c56eeb 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
 obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
 obj-$(CONFIG_BLK_DEV_RBD)     += rbd.o
 obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_ZRAM) += zram/
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..a7eda14be639 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,6 +81,10 @@
 
 #include "loop.h"
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+#include "loopfs/loopfs.h"
+#endif
+
 #include <linux/uaccess.h>
 
 static DEFINE_IDR(loop_index_idr);
@@ -1142,7 +1146,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	}
 	set_capacity(lo->lo_disk, 0);
 	loop_sysfs_exit(lo);
-	if (bdev) {
+	if (bdev && bdev->bd_disk) {
 		bd_set_size(bdev, 0);
 		/* let user-space know about this change */
 		kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
@@ -1152,7 +1156,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
-	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk;
 	lo_number = lo->lo_number;
 	loop_unprepare_queue(lo);
 out_unlock:
@@ -1237,6 +1241,61 @@ static int loop_clr_fd(struct loop_device *lo)
 	return __loop_clr_fd(lo, false);
 }
 
+static void loop_remove(struct loop_device *lo)
+{
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	loopfs_loop_device_remove(lo);
+#endif
+	del_gendisk(lo->lo_disk);
+	blk_cleanup_queue(lo->lo_queue);
+	blk_mq_free_tag_set(&lo->tag_set);
+	put_disk(lo->lo_disk);
+	kfree(lo);
+}
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+int loopfs_rundown_locked(struct loop_device *lo)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(!loopfs_i_sb(lo->lo_loopfs_i)))
+		return -EINVAL;
+
+	ret = mutex_lock_killable(&loop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) {
+		ret = -EBUSY;
+	} else {
+		/*
+		 * Since the device is unbound it has no associated backing
+		 * file and we can safely set Lo_rundown to prevent it from
+		 * being found. Actual cleanup happens during inode eviction.
+		 */
+		lo->lo_state = Lo_rundown;
+		ret = 0;
+	}
+
+	mutex_unlock(&loop_ctl_mutex);
+	return ret;
+}
+
+void loopfs_remove_locked(struct loop_device *lo)
+{
+	if (WARN_ON_ONCE(!loopfs_i_sb(lo->lo_loopfs_i)))
+		return;
+
+	loop_clr_fd(lo);
+
+	mutex_lock(&loop_ctl_mutex);
+	idr_remove(&loop_index_idr, lo->lo_number);
+	lo->lo_loopfs_i = NULL;
+	loop_remove(lo);
+	mutex_unlock(&loop_ctl_mutex);
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1856,6 +1915,11 @@ static const struct block_device_operations lo_fops = {
  * And now the modules code and kernel interface.
  */
 static int max_loop;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+unsigned long max_devices;
+#else
+static unsigned long max_devices;
+#endif
 module_param(max_loop, int, 0444);
 MODULE_PARM_DESC(max_loop, "Maximum number of loop devices");
 module_param(max_part, int, 0444);
@@ -1981,7 +2045,7 @@ static const struct blk_mq_ops loop_mq_ops = {
 	.complete	= lo_complete_rq,
 };
 
-static int loop_add(struct loop_device **l, int i)
+static int loop_add(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	struct gendisk *disk;
@@ -2071,7 +2135,17 @@ static int loop_add(struct loop_device **l, int i)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+
 	add_disk(disk);
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	err = loopfs_loop_device_create(lo, inode, disk_devt(disk));
+	if (err) {
+		loop_remove(lo);
+		goto out;
+	}
+#endif
+
 	*l = lo;
 	return lo->lo_number;
 
@@ -2087,36 +2161,41 @@ static int loop_add(struct loop_device **l, int i)
 	return err;
 }
 
-static void loop_remove(struct loop_device *lo)
-{
-	del_gendisk(lo->lo_disk);
-	blk_cleanup_queue(lo->lo_queue);
-	blk_mq_free_tag_set(&lo->tag_set);
-	put_disk(lo->lo_disk);
-	kfree(lo);
-}
+struct find_free_cb_data {
+	struct loop_device **l;
+	struct inode *inode;
+};
 
 static int find_free_cb(int id, void *ptr, void *data)
 {
 	struct loop_device *lo = ptr;
-	struct loop_device **l = data;
+	struct find_free_cb_data *cb_data = data;
 
-	if (lo->lo_state == Lo_unbound) {
-		*l = lo;
-		return 1;
-	}
-	return 0;
+	if (lo->lo_state != Lo_unbound)
+		return 0;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (!loopfs_same_instance(cb_data->inode, lo->lo_loopfs_i))
+		return 0;
+#endif
+
+	*cb_data->l = lo;
+	return 1;
 }
 
-static int loop_lookup(struct loop_device **l, int i)
+static int loop_lookup(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	int ret = -ENODEV;
 
 	if (i < 0) {
 		int err;
+		struct find_free_cb_data cb_data = {
+			.l = &lo,
+			.inode = inode,
+		};
 
-		err = idr_for_each(&loop_index_idr, &find_free_cb, &lo);
+		err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data);
 		if (err == 1) {
 			*l = lo;
 			ret = lo->lo_number;
@@ -2127,6 +2206,11 @@ static int loop_lookup(struct loop_device **l, int i)
 	/* lookup and return a specific i */
 	lo = idr_find(&loop_index_idr, i);
 	if (lo) {
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		if (!loopfs_same_instance(inode, lo->lo_loopfs_i))
+			return -EACCES;
+#endif
+
 		*l = lo;
 		ret = lo->lo_number;
 	}
@@ -2141,9 +2225,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data)
 	int err;
 
 	mutex_lock(&loop_ctl_mutex);
-	err = loop_lookup(&lo, MINOR(dev) >> part_shift);
+	err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
-		err = loop_add(&lo, MINOR(dev) >> part_shift);
+		err = loop_add(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
 		kobj = NULL;
 	else
@@ -2167,15 +2251,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 	ret = -ENOSYS;
 	switch (cmd) {
 	case LOOP_CTL_ADD:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret >= 0) {
 			ret = -EEXIST;
 			break;
 		}
-		ret = loop_add(&lo, parm);
+		ret = loop_add(&lo, parm, file_inode(file));
 		break;
 	case LOOP_CTL_REMOVE:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret < 0)
 			break;
 		if (lo->lo_state != Lo_unbound) {
@@ -2191,10 +2275,10 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 		loop_remove(lo);
 		break;
 	case LOOP_CTL_GET_FREE:
-		ret = loop_lookup(&lo, -1);
+		ret = loop_lookup(&lo, -1, file_inode(file));
 		if (ret >= 0)
 			break;
-		ret = loop_add(&lo, -1);
+		ret = loop_add(&lo, -1, file_inode(file));
 	}
 	mutex_unlock(&loop_ctl_mutex);
 
@@ -2221,7 +2305,6 @@ MODULE_ALIAS("devname:loop-control");
 static int __init loop_init(void)
 {
 	int i, nr;
-	unsigned long range;
 	struct loop_device *lo;
 	int err;
 
@@ -2260,10 +2343,10 @@ static int __init loop_init(void)
 	 */
 	if (max_loop) {
 		nr = max_loop;
-		range = max_loop << part_shift;
+		max_devices = max_loop << part_shift;
 	} else {
 		nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT;
-		range = 1UL << MINORBITS;
+		max_devices = 1UL << MINORBITS;
 	}
 
 	err = misc_register(&loop_misc);
@@ -2276,13 +2359,13 @@ static int __init loop_init(void)
 		goto misc_out;
 	}
 
-	blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
+	blk_register_region(MKDEV(LOOP_MAJOR, 0), max_devices,
 				  THIS_MODULE, loop_probe, NULL, NULL);
 
 	/* pre-create number of devices given by config or max_loop */
 	mutex_lock(&loop_ctl_mutex);
 	for (i = 0; i < nr; i++)
-		loop_add(&lo, i);
+		loop_add(&lo, i, NULL);
 	mutex_unlock(&loop_ctl_mutex);
 
 	printk(KERN_INFO "loop: module loaded\n");
@@ -2304,14 +2387,10 @@ static int loop_exit_cb(int id, void *ptr, void *data)
 
 static void __exit loop_exit(void)
 {
-	unsigned long range;
-
-	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
-
 	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
 	idr_destroy(&loop_index_idr);
 
-	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
+	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), max_devices);
 	unregister_blkdev(LOOP_MAJOR, "loop");
 
 	misc_deregister(&loop_misc);
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index af75a5ee4094..f740c7910023 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -62,6 +62,9 @@ struct loop_device {
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct inode		*lo_loopfs_i;
+#endif
 };
 
 struct loop_cmd {
@@ -89,6 +92,9 @@ struct loop_func_table {
 }; 
 
 int loop_register_transfer(struct loop_func_table *funcs);
-int loop_unregister_transfer(int number); 
+int loop_unregister_transfer(int number);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+extern unsigned long max_devices;
+#endif
 
 #endif
diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile
new file mode 100644
index 000000000000..87ec703b662e
--- /dev/null
+++ b/drivers/block/loopfs/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+loopfs-y			:= loopfs.o
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs.o
diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
new file mode 100644
index 000000000000..ac46aa337008
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.c
@@ -0,0 +1,429 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/fs.h>
+#include <linux/fs_parser.h>
+#include <linux/fsnotify.h>
+#include <linux/genhd.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/magic.h>
+#include <linux/major.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include "../loop.h"
+#include "loopfs.h"
+
+#define FIRST_INODE 1
+#define SECOND_INODE 2
+#define INODE_OFFSET 3
+
+enum loopfs_param {
+	Opt_max,
+};
+
+const struct fs_parameter_spec loopfs_fs_parameters[] = {
+	fsparam_u32("max",	Opt_max),
+	{}
+};
+
+struct loopfs_mount_opts {
+	int max;
+};
+
+struct loopfs_info {
+	kuid_t root_uid;
+	kgid_t root_gid;
+	unsigned long device_count;
+	struct dentry *control_dentry;
+	struct loopfs_mount_opts mount_opts;
+};
+
+static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+/**
+ * loopfs_loop_device_create - allocate inode from super block of a loopfs mount
+ * @lo:		loop device for which we are creating a new device entry
+ * @ref_inode:	inode from wich the super block will be taken
+ * @device_nr:  device number of the associated disk device
+ *
+ * This function creates a new device node for @lo.
+ * Minor numbers are limited and tracked globally. The
+ * function will stash a struct loop_device for the specific loop
+ * device in i_private of the inode.
+ * It will go on to allocate a new inode from the super block of the
+ * filesystem mount, stash a struct loop_device in its i_private field
+ * and attach a dentry to that inode.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+int loopfs_loop_device_create(struct loop_device *lo, struct inode *ref_inode,
+			      dev_t device_nr)
+{
+	char name[DISK_NAME_LEN];
+	struct super_block *sb;
+	struct loopfs_info *info;
+	struct dentry *root, *dentry;
+	struct inode *inode;
+
+	sb = loopfs_i_sb(ref_inode);
+	if (!sb)
+		return 0;
+
+	if (MAJOR(device_nr) != LOOP_MAJOR)
+		return -EINVAL;
+
+	info = LOOPFS_SB(sb);
+	if ((info->device_count + 1) > info->mount_opts.max)
+		return -ENOSPC;
+
+	if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name))
+		return -EINVAL;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop device in this loopfs instance is opened.
+	 */
+	inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+	init_special_inode(inode, S_IFBLK | 0600, device_nr);
+
+	root = sb->s_root;
+	inode_lock(d_inode(root));
+	/* look it up */
+	dentry = lookup_one_len(name, root, strlen(name));
+	if (IS_ERR(dentry)) {
+		inode_unlock(d_inode(root));
+		iput(inode);
+		return PTR_ERR(dentry);
+	}
+
+	if (d_really_is_positive(dentry)) {
+		/* already exists */
+		dput(dentry);
+		inode_unlock(d_inode(root));
+		iput(inode);
+		return -EEXIST;
+	}
+
+	d_instantiate(dentry, inode);
+	fsnotify_create(d_inode(root), dentry);
+	inode_unlock(d_inode(root));
+
+	inode->i_private = lo;
+	lo->lo_loopfs_i = inode;
+	info->device_count++;
+
+	return 0;
+}
+
+void loopfs_loop_device_remove(struct loop_device *lo)
+{
+	struct inode *inode = lo->lo_loopfs_i;
+	struct super_block *sb;
+	struct dentry *root, *dentry;
+
+	if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR)
+		return;
+
+	sb = loopfs_i_sb(inode);
+	if (!sb)
+		return;
+	lo->lo_loopfs_i = NULL;
+
+	/*
+	 * The root dentry is always the parent dentry since we don't allow
+	 * creation of directories.
+	 */
+	root = sb->s_root;
+
+	inode_lock(d_inode(root));
+	dentry = d_find_any_alias(inode);
+	if (dentry && simple_positive(dentry)) {
+		simple_unlink(d_inode(root), dentry);
+		d_delete(dentry);
+	}
+	dput(dentry);
+	inode_unlock(d_inode(root));
+	LOOPFS_SB(sb)->device_count--;
+}
+
+static void loopfs_fs_context_free(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+
+	kfree(ctx);
+}
+
+/**
+ * loopfs_loop_ctl_create - create a new loop-control device
+ * @sb: super block of the loopfs mount
+ *
+ * This function creates a new loop-control device node in the loopfs mount
+ * referred to by @sb.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static int loopfs_loop_ctl_create(struct super_block *sb)
+{
+	struct dentry *dentry;
+	struct inode *inode = NULL;
+	struct dentry *root = sb->s_root;
+	struct loopfs_info *info = sb->s_fs_info;
+
+	if (info->control_dentry)
+		return 0;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = SECOND_INODE;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	init_special_inode(inode, S_IFCHR | 0600,
+			   MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR));
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop-control device in this loopfs instance is opened.
+	 */
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+
+	dentry = d_alloc_name(root, "loop-control");
+	if (!dentry) {
+		iput(inode);
+		return -ENOMEM;
+	}
+
+	info->control_dentry = dentry;
+	d_add(dentry, inode);
+
+	return 0;
+}
+
+static inline bool is_loopfs_control_device(const struct dentry *dentry)
+{
+	return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry;
+}
+
+static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			 struct inode *new_dir, struct dentry *new_dentry,
+			 unsigned int flags)
+{
+	if (is_loopfs_control_device(old_dentry) ||
+	    is_loopfs_control_device(new_dentry))
+		return -EPERM;
+
+	return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
+}
+
+static int loopfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int ret;
+	struct loop_device *lo;
+
+	if (is_loopfs_control_device(dentry))
+		return -EPERM;
+
+	lo = d_inode(dentry)->i_private;
+	ret = loopfs_rundown_locked(lo);
+	if (ret)
+		return ret;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations loopfs_dir_inode_operations = {
+	.lookup = simple_lookup,
+	.rename = loopfs_rename,
+	.unlink = loopfs_unlink,
+};
+
+static void loopfs_evict_inode(struct inode *inode)
+{
+	struct loop_device *lo = inode->i_private;
+
+	clear_inode(inode);
+
+	/* Only cleanup loop devices, not the loop control device. */
+	if (S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) {
+		loopfs_remove_locked(lo);
+		LOOPFS_SB(inode->i_sb)->device_count--;
+	}
+}
+
+static int loopfs_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct loopfs_info *info = LOOPFS_SB(root->d_sb);
+
+	if (info->mount_opts.max <= max_devices)
+		seq_printf(seq, ",max=%d", info->mount_opts.max);
+
+	return 0;
+}
+
+static void loopfs_put_super(struct super_block *sb)
+{
+	struct loopfs_info *info = sb->s_fs_info;
+
+	sb->s_fs_info = NULL;
+	kfree(info);
+}
+
+static const struct super_operations loopfs_super_ops = {
+	.evict_inode    = loopfs_evict_inode,
+	.show_options	= loopfs_show_options,
+	.statfs         = simple_statfs,
+	.put_super	= loopfs_put_super,
+};
+
+static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct loopfs_info *info;
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct inode *inode = NULL;
+
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+
+	/*
+	 * The loopfs filesystem can be mounted by userns root in a
+	 * non-initial userns. By default such mounts have the SB_I_NODEV flag
+	 * set in s_iflags to prevent security issues where userns root can
+	 * just create random device nodes via mknod() since it owns the
+	 * filesystem mount. But loopfs does not allow to create any files
+	 * including devices nodes. The only way to create binder devices nodes
+	 * is through the loop-control device which userns root is explicitly
+	 * allowed to do. So removing the SB_I_NODEV flag from s_iflags is both
+	 * necessary and safe.
+	 */
+	sb->s_iflags &= ~SB_I_NODEV;
+	sb->s_iflags |= SB_I_NOEXEC;
+	sb->s_magic = LOOPFS_SUPER_MAGIC;
+	sb->s_op = &loopfs_super_ops;
+	sb->s_time_gran = 1;
+
+	sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL);
+	if (!sb->s_fs_info)
+		return -ENOMEM;
+	info = sb->s_fs_info;
+
+	info->root_gid = make_kgid(sb->s_user_ns, 0);
+	if (!gid_valid(info->root_gid))
+		info->root_gid = GLOBAL_ROOT_GID;
+	info->root_uid = make_kuid(sb->s_user_ns, 0);
+	if (!uid_valid(info->root_uid))
+		info->root_uid = GLOBAL_ROOT_UID;
+	info->mount_opts.max = ctx->max;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = FIRST_INODE;
+	inode->i_fop = &simple_dir_operations;
+	inode->i_mode = S_IFDIR | 0755;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_op = &loopfs_dir_inode_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return loopfs_loop_ctl_create(sb);
+}
+
+static int loopfs_fs_context_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, loopfs_fill_super);
+}
+
+static int loopfs_fs_context_parse_param(struct fs_context *fc,
+					 struct fs_parameter *param)
+{
+	int opt;
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct fs_parse_result result;
+
+	opt = fs_parse(fc, loopfs_fs_parameters, param, &result);
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_max:
+		if (result.uint_32 > max_devices)
+			return invalfc(fc, "Bad value for '%s'", param->key);
+
+		ctx->max = result.uint_32;
+		break;
+	default:
+		return invalfc(fc, "Unsupported parameter '%s'", param->key);
+	}
+
+	return 0;
+}
+
+static int loopfs_fs_context_reconfigure(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct loopfs_info *info = LOOPFS_SB(fc->root->d_sb);
+
+	info->mount_opts.max = ctx->max;
+	return 0;
+}
+
+static const struct fs_context_operations loopfs_fs_context_ops = {
+	.free		= loopfs_fs_context_free,
+	.get_tree	= loopfs_fs_context_get_tree,
+	.parse_param	= loopfs_fs_context_parse_param,
+	.reconfigure	= loopfs_fs_context_reconfigure,
+};
+
+static int loopfs_init_fs_context(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+
+	ctx = kzalloc(sizeof(struct loopfs_mount_opts), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->max = max_devices;
+
+	fc->fs_private = ctx;
+
+	fc->ops = &loopfs_fs_context_ops;
+
+	return 0;
+}
+
+static struct file_system_type loop_fs_type = {
+	.name			= "loop",
+	.init_fs_context	= loopfs_init_fs_context,
+	.parameters		= loopfs_fs_parameters,
+	.kill_sb		= kill_litter_super,
+	.fs_flags		= FS_USERNS_MOUNT,
+};
+
+int __init init_loopfs(void)
+{
+	return register_filesystem(&loop_fs_type);
+}
+
+module_init(init_loopfs);
+MODULE_AUTHOR("Christian Brauner <christian.brauner@ubuntu.com>");
+MODULE_DESCRIPTION("Loop device filesystem");
diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
new file mode 100644
index 000000000000..b649f225eb8c
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_LOOPFS_FS_H
+#define _LINUX_LOOPFS_FS_H
+
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+
+struct loop_device;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+
+extern inline struct super_block *loopfs_i_sb(const struct inode *inode)
+{
+	if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC)
+		return inode->i_sb;
+
+	return NULL;
+}
+static inline bool loopfs_same_instance(const struct inode *first,
+					const struct inode *second)
+{
+	return loopfs_i_sb(first) == loopfs_i_sb(second);
+}
+extern int loopfs_loop_device_create(struct loop_device *lo,
+				     struct inode *ref_inode, dev_t device_nr);
+extern void loopfs_loop_device_remove(struct loop_device *lo);
+
+extern void loopfs_remove_locked(struct loop_device *lo);
+extern int loopfs_rundown_locked(struct loop_device *lo);
+
+#endif
+
+#endif /* _LINUX_LOOPFS_FS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..0817d093a012 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -75,6 +75,7 @@
 #define BINFMTFS_MAGIC          0x42494e4d
 #define DEVPTS_SUPER_MAGIC	0x1cd1
 #define BINDERFS_SUPER_MAGIC	0x6c6f6f70
+#define LOOPFS_SUPER_MAGIC	0x6c6f6f71
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define PIPEFS_MAGIC            0x50495045
 #define PROC_SUPER_MAGIC	0x9fa0
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/8] loop: use ns_capable for some loop operations
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
  2020-04-08 15:21 ` [PATCH 1/8] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
  2020-04-08 15:21 ` [PATCH 2/8] loopfs: implement loopfs Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-08 15:21 ` [PATCH 4/8] kernfs: handle multiple namespace tags Christian Brauner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

The following  LOOP_GET_STATUS, LOOP_SET_STATUS, and LOOP_SET_BLOCK_SIZE
operations are now allowed in non-initial namespaces. Most other
operations were already possible before.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 drivers/block/loop.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a7eda14be639..b1c3436d6e38 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1294,6 +1294,25 @@ void loopfs_remove_locked(struct loop_device *lo)
 	loop_remove(lo);
 	mutex_unlock(&loop_ctl_mutex);
 }
+
+static bool loop_capable(const struct loop_device *lo, int cap)
+{
+	struct super_block *sb;
+	struct user_namespace *ns;
+
+	sb = loopfs_i_sb(lo->lo_loopfs_i);
+	if (sb)
+		ns = sb->s_user_ns;
+	else
+		ns = &init_user_ns;
+
+	return ns_capable(ns, cap);
+}
+#else /* !CONFIG_BLK_DEV_LOOPFS */
+static inline bool loop_capable(const struct loop_device *lo, int cap)
+{
+	return capable(cap);
+}
 #endif /* CONFIG_BLK_DEV_LOOPFS */
 
 static int
@@ -1310,7 +1329,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return err;
 	if (lo->lo_encrypt_key_size &&
 	    !uid_eq(lo->lo_key_owner, uid) &&
-	    !capable(CAP_SYS_ADMIN)) {
+	    !loop_capable(lo, CAP_SYS_ADMIN)) {
 		err = -EPERM;
 		goto out_unlock;
 	}
@@ -1441,7 +1460,7 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 	memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE);
 	info->lo_encrypt_type =
 		lo->lo_encryption ? lo->lo_encryption->number : 0;
-	if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) {
+	if (lo->lo_encrypt_key_size && loop_capable(lo, CAP_SYS_ADMIN)) {
 		info->lo_encrypt_key_size = lo->lo_encrypt_key_size;
 		memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
 		       lo->lo_encrypt_key_size);
@@ -1665,7 +1684,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_clr_fd(lo);
 	case LOOP_SET_STATUS:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status_old(lo,
 					(struct loop_info __user *)arg);
 		}
@@ -1674,7 +1693,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_get_status_old(lo, (struct loop_info __user *) arg);
 	case LOOP_SET_STATUS64:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status64(lo,
 					(struct loop_info64 __user *) arg);
 		}
@@ -1684,7 +1703,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 	case LOOP_SET_CAPACITY:
 	case LOOP_SET_DIRECT_IO:
 	case LOOP_SET_BLOCK_SIZE:
-		if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
+		if (!(mode & FMODE_WRITE) && !loop_capable(lo, CAP_SYS_ADMIN))
 			return -EPERM;
 		/* Fall through */
 	default:
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/8] kernfs: handle multiple namespace tags
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (2 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 3/8] loop: use ns_capable for some loop operations Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-13 18:46   ` Tejun Heo
  2020-04-08 15:21 ` [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace Christian Brauner
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Since [1] kernfs supports namespace tags. This feature is essential to
enable sysfs to present different views of on various parts depending on
the namespace tag. For example, the /sys/class/net/ directory will only
show network devices that belong to the network namespace that sysfs was
mounted in. This is achieved by stashing a reference to the network
namespace of the task mounting sysfs in the super block. And when a
lookup operation is performed on e.g. /sys/class/net/ kernfs will
compare the network namespace tag of the kernfs_node associated with the
device and kobject of the network device to the network namespace of the
network device. This ensures that only network devices owned by the
network namespace sysfs was mounted in are shown, a feature which is
essential to containers.
For loopfs to show correct permissions in sysfs just as with network
devices we need to be able to tag kernfs_super_info with additional
namespaces. This extension was even already mentioned in a comment to
struct kernfs_super_info:
  /*
   * Each sb is associated with one namespace tag, currently the
   * network namespace of the task which mounted this kernfs
   * instance.  If multiple tags become necessary, make the following
   * an array and compare kernfs_node tag against every entry.
   */
This patch extends the kernfs_super_info and kernfs_fs_context ns
pointers to fixed-size arrays of namespace tags. The size is taken from
the namespaces currently supported by kobjects, i.e. we don't extend it
to cover all namespace but only the ones kernfs needs to support.
In addition, the kernfs_node struct gains an additional member that
indicates the type of namespace this kernfs_node was tagged with. This
allows us to simply retrieve the correct namespace tag from the
kernfs_fs_context and kernfs_super_info ns array with a simple indexing
operation. This has the advantage that we can just keep passing down the
correct namespace instead of passing down the array.

[1]: 608b4b9548de ("netns: Teach network device kobjects which namespace they are in.")
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/kernfs/dir.c             |  6 +++---
 fs/kernfs/kernfs-internal.h |  9 ++++-----
 fs/kernfs/mount.c           | 11 +++++++----
 fs/sysfs/mount.c            | 10 +++++-----
 include/linux/kernfs.h      | 22 ++++++++++++++--------
 include/linux/sysfs.h       |  8 +++++---
 lib/kobject.c               |  2 +-
 7 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..1f2d894ae454 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -576,7 +576,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 
 	/* The kernfs node has been moved to a different namespace */
 	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns != kn->ns)
+	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
 		goto out_bad;
 
 	mutex_unlock(&kernfs_mutex);
@@ -1087,7 +1087,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dir->i_sb)->ns;
+		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
 
@@ -1673,7 +1673,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dentry->d_sb)->ns;
+		ns = kernfs_info(dentry->d_sb)->ns[parent->ns_type];
 
 	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
 	     pos;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2f3c51d55261..6c375eb59460 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -16,6 +16,7 @@
 #include <linux/xattr.h>
 
 #include <linux/kernfs.h>
+#include <linux/kobject_ns.h>
 #include <linux/fs_context.h>
 
 struct kernfs_iattrs {
@@ -60,12 +61,10 @@ struct kernfs_super_info {
 	struct kernfs_root	*root;
 
 	/*
-	 * Each sb is associated with one namespace tag, currently the
-	 * network namespace of the task which mounted this kernfs
-	 * instance.  If multiple tags become necessary, make the following
-	 * an array and compare kernfs_node tag against every entry.
+	 * Each sb can be associated with namespace tags. They will be used
+	 * to compare kernfs_node tags against relevant entries.
 	 */
-	const void		*ns;
+	const void		*ns[KOBJ_NS_TYPES];
 
 	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
 	struct list_head	node;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10..dc4ee0f0a597 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -279,14 +279,15 @@ static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
 	struct kernfs_super_info *sb_info = kernfs_info(sb);
 	struct kernfs_super_info *info = fc->s_fs_info;
 
-	return sb_info->root == info->root && sb_info->ns == info->ns;
+	return sb_info->root == info->root &&
+	       memcmp(sb_info->ns, info->ns, sizeof(sb_info->ns)) == 0;
 }
 
 static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	kfc->ns_tag = NULL;
+	memset(kfc->ns_tag, 0, sizeof(kfc->ns_tag));
 	return set_anon_super_fc(sb, fc);
 }
 
@@ -296,7 +297,7 @@ static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
  *
  * Return the namespace tag associated with kernfs super_block @sb.
  */
-const void *kernfs_super_ns(struct super_block *sb)
+const void **kernfs_super_ns(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 
@@ -324,7 +325,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		return -ENOMEM;
 
 	info->root = kfc->root;
-	info->ns = kfc->ns_tag;
+	BUILD_BUG_ON(sizeof(info->ns) != sizeof(kfc->ns_tag));
+	memcpy(info->ns, kfc->ns_tag, sizeof(info->ns));
+
 	INIT_LIST_HEAD(&info->node);
 
 	fc->s_fs_info = info;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index db81cfbab9d6..5e2ec88a709e 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -41,8 +41,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	if (kfc->ns_tag)
-		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
+		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -66,7 +66,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 	if (!kfc)
 		return -ENOMEM;
 
-	kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -81,10 +81,10 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 
 static void sysfs_kill_sb(struct super_block *sb)
 {
-	void *ns = (void *)kernfs_super_ns(sb);
+	void **ns = (void **)kernfs_super_ns(sb);
 
 	kernfs_kill_sb(sb);
-	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
+	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
 }
 
 static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index dded2e5a9f42..0e4414bd7007 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
 #include <linux/atomic.h>
 #include <linux/uidgid.h>
 #include <linux/wait.h>
+#include <linux/kobject_ns.h>
 
 struct file;
 struct dentry;
@@ -130,8 +131,9 @@ struct kernfs_node {
 
 	struct rb_node		rb;
 
-	const void		*ns;	/* namespace tag */
-	unsigned int		hash;	/* ns + name hash */
+	const void		*ns;		/* namespace tag */
+	enum kobj_ns_type	ns_type;	/* type of namespace tag */
+	unsigned int		hash;		/* ns + name hash */
 	union {
 		struct kernfs_elem_dir		dir;
 		struct kernfs_elem_symlink	symlink;
@@ -268,7 +270,7 @@ struct kernfs_ops {
  */
 struct kernfs_fs_context {
 	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
-	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
+	void			*ns_tag[KOBJ_NS_TYPES]; /* Namespace tags of the mount (or empty) */
 	unsigned long		magic;		/* File system specific magic number */
 
 	/* The following are set/used by kernfs_mount() */
@@ -312,17 +314,20 @@ static inline ino_t kernfs_gen(struct kernfs_node *kn)
 
 /**
  * kernfs_enable_ns - enable namespace under a directory
- * @kn: directory of interest, should be empty
+ * @kn:		directory of interest, should be empty
+ * @ns_type:	type of namespace that should be enabled for this directory
  *
  * This is to be called right after @kn is created to enable namespace
  * under it.  All children of @kn must have non-NULL namespace tags and
  * only the ones which match the super_block's tag will be visible.
  */
-static inline void kernfs_enable_ns(struct kernfs_node *kn)
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type)
 {
 	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
 	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
 	kn->flags |= KERNFS_NS;
+	kn->ns_type = ns_type;
 }
 
 /**
@@ -394,7 +399,7 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 		     const void *value, size_t size, int flags);
 
-const void *kernfs_super_ns(struct super_block *sb);
+const void **kernfs_super_ns(struct super_block *sb);
 int kernfs_get_tree(struct fs_context *fc);
 void kernfs_free_fs_context(struct fs_context *fc);
 void kernfs_kill_sb(struct super_block *sb);
@@ -408,7 +413,8 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
 { return 0; }	/* whatever */
 
-static inline void kernfs_enable_ns(struct kernfs_node *kn) { }
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type) { }
 
 static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 { return false; }
@@ -504,7 +510,7 @@ static inline int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 				   const void *value, size_t size, int flags)
 { return -ENOSYS; }
 
-static inline const void *kernfs_super_ns(struct super_block *sb)
+static inline const void **kernfs_super_ns(struct super_block *sb)
 { return NULL; }
 
 static inline int kernfs_get_tree(struct fs_context *fc)
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index fa7ee503fb76..52eda7159f09 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -305,9 +305,10 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
 
 int __must_check sysfs_init(void);
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
-	return kernfs_enable_ns(kn);
+	return kernfs_enable_ns(kn, ns_type);
 }
 
 #else /* CONFIG_SYSFS */
@@ -518,7 +519,8 @@ static inline int __must_check sysfs_init(void)
 	return 0;
 }
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
 }
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 6f07083cc111..c58c62d49a10 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -120,7 +120,7 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(ops->type >= KOBJ_NS_TYPES);
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
-		sysfs_enable_ns(kobj->sd);
+		sysfs_enable_ns(kobj->sd, ops->type);
 	}
 
 	return 0;
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (3 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 4/8] kernfs: handle multiple namespace tags Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-13 19:02   ` Tejun Heo
  2020-04-08 15:21 ` [PATCH 6/8] genhd: add minimal namespace infrastructure Christian Brauner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

The initial namespace is special in many ways. One feature it always has
had is that it propagates all its devices into all non-initial
namespaces. This is e.g. true for all device classes under /sys/class/
except the net_class. Even though none of the propagated files can be
used there are still a lot of read-only values that are accessed or read
by tools running in non-initial namespaces. To not regress such
workloads we introduce the ability to tell kernfs to continue
propagating devices from the initial namespace even when the kernfs_node
is tagged with a non-initial namespace. Note that this is a purely
opt-in feature, i.e. if there were a new device class that wanted to
make use of this new infrastructure and did not want to propagate any
devices into non-initial namespaces it could simply not implement the
relevant callback.
When a new directory in sysfs is created sysfs now can simply check
whether the relevant device wants to propagate objects from the initial
namespace or not.

Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/kernfs/dir.c             | 34 +++++++++++++++++++++++++++++-----
 fs/kernfs/kernfs-internal.h | 14 ++++++++++++++
 include/linux/kernfs.h      | 22 ++++++++++++++++++++++
 include/linux/kobject_ns.h  |  3 +++
 lib/kobject.c               |  2 ++
 5 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1f2d894ae454..02796ba6521a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved to a different namespace */
-	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
-		goto out_bad;
+	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
+		if (kernfs_init_ns_propagates(kn->parent) &&
+		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
+			goto out_good;
+		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
+			goto out_bad;
+	}
 
+out_good:
 	mutex_unlock(&kernfs_mutex);
 	return 1;
 out_bad:
@@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	if (!kn && kernfs_init_ns_propagates(parent)) {
+		ns = kernfs_init_ns(parent->ns_type);
+		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	}
 
 	/* no such entry */
 	if (!kn || !kernfs_active(kn)) {
@@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+	const void *init_ns;
+
 	if (pos) {
 		int valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
@@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		if (!valid)
 			pos = NULL;
 	}
+
+	if (kernfs_init_ns_propagates(parent))
+		init_ns = kernfs_init_ns(parent->ns_type);
+	else
+		init_ns = NULL;
+
 	if (!pos && (hash > 1) && (hash < INT_MAX)) {
 		struct rb_node *node = parent->dir.children.rb_node;
 		while (node) {
@@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 {
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
 	if (pos) {
+		const void *init_ns;
+		if (kernfs_init_ns_propagates(parent))
+			init_ns = kernfs_init_ns(parent->ns_type);
+		else
+			init_ns = NULL;
+
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
 			if (!node)
 				pos = NULL;
 			else
 				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
+		} while (pos && (!kernfs_active(pos) ||
+				 (pos->ns != ns && pos->ns != init_ns)));
 	}
 	return pos;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 6c375eb59460..4ba7b36103de 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -78,6 +78,20 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
 	return d_inode(dentry)->i_private;
 }
 
+extern struct net init_net;
+
+static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
+{
+	switch (ns_type) {
+	case KOBJ_NS_TYPE_NET:
+		return &init_net;
+	default:
+		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
+	}
+
+	return NULL;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 0e4414bd7007..5e2143e69c1c 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -51,6 +51,7 @@ enum kernfs_node_flag {
 	KERNFS_SUICIDED		= 0x0800,
 	KERNFS_EMPTY_DIR	= 0x1000,
 	KERNFS_HAS_RELEASE	= 0x2000,
+	KERNFS_NS_PROPAGATE	= 0x4000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -330,6 +331,27 @@ static inline void kernfs_enable_ns(struct kernfs_node *kn,
 	kn->ns_type = ns_type;
 }
 
+static inline void kernfs_enable_init_ns_propagates(struct kernfs_node *kn)
+{
+	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
+	WARN_ON_ONCE(!(kn->flags & KERNFS_NS));
+	kn->flags |= KERNFS_NS_PROPAGATE;
+}
+
+/**
+ * kernfs_init_ns_propagates - test whether init ns propagates
+ * @kn: the node to test
+ *
+ * Test whether kernfs entries created in the init namespace propagate into
+ * other namespaces.
+ */
+static inline bool kernfs_init_ns_propagates(const struct kernfs_node *kn)
+{
+	return ((kn->flags & (KERNFS_NS | KERNFS_NS_PROPAGATE)) ==
+		(KERNFS_NS | KERNFS_NS_PROPAGATE));
+}
+
 /**
  * kernfs_ns_enabled - test whether namespace is enabled
  * @kn: the node to test
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 991a9286bcea..216f9112ee1d 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -34,6 +34,8 @@ enum kobj_ns_type {
  *   @grab_current_ns: return a new reference to calling task's namespace
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
+ *   @initial_ns_propagates: whether devices in the initial namespace propagate
+ *			to all other namespaces
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
@@ -41,6 +43,7 @@ struct kobj_ns_type_operations {
 	void *(*grab_current_ns)(void);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
+	bool (*initial_ns_propagates)(void);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
diff --git a/lib/kobject.c b/lib/kobject.c
index c58c62d49a10..96bb8c732d1c 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -121,6 +121,8 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
 		sysfs_enable_ns(kobj->sd, ops->type);
+		if (ops->initial_ns_propagates && ops->initial_ns_propagates())
+			kernfs_enable_init_ns_propagates(kobj->sd);
 	}
 
 	return 0;
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 6/8] genhd: add minimal namespace infrastructure
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (4 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-13 19:04   ` Tejun Heo
  2020-04-08 15:21 ` [PATCH 7/8] loopfs: start attaching correct namespace during loop_add() Christian Brauner
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

This lets the block_class properly support loopfs device by introducing
the minimal infrastructure needed to support different sysfs views for
devices belonging to the block_class. This is similar to how network
devices work. Note, that nothing changes with this patch since
all block_class devices are tagged explicitly with init_user_ns whereas
they were tagged implicitly with init_user_ns before. No code is added
if CONFIG_BLK_DEV_LOOPFS is not set.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 block/genhd.c               | 79 +++++++++++++++++++++++++++++++++++++
 fs/kernfs/kernfs-internal.h |  3 ++
 fs/sysfs/mount.c            |  4 ++
 include/linux/genhd.h       |  3 ++
 include/linux/kobject_ns.h  |  1 +
 5 files changed, 90 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 9c2e13ce0d19..a6d51d9a94f6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1127,11 +1127,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
 	return NULL;
 }
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+static void *user_grab_current_ns(void)
+{
+	struct user_namespace *ns = current_user_ns();
+	return get_user_ns(ns);
+}
+
+static const void *user_initial_ns(void)
+{
+	return &init_user_ns;
+}
+
+static void user_put_ns(void *p)
+{
+	struct user_namespace *ns = p;
+	put_user_ns(ns);
+}
+
+static bool user_current_may_mount(void)
+{
+	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
+}
+
+const struct kobj_ns_type_operations user_ns_type_operations = {
+	.type			= KOBJ_NS_TYPE_USER,
+	.current_may_mount	= user_current_may_mount,
+	.grab_current_ns	= user_grab_current_ns,
+	.initial_ns		= user_initial_ns,
+	.drop_ns		= user_put_ns,
+};
+
+static const void *block_class_user_namespace(struct device *dev)
+{
+	struct gendisk *disk;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	return disk->user_ns;
+}
+
+static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
+{
+	struct gendisk *disk;
+	struct user_namespace *ns;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	ns = disk->user_ns;
+	if (ns && ns != &init_user_ns) {
+		kuid_t ns_root_uid = make_kuid(ns, 0);
+		kgid_t ns_root_gid = make_kgid(ns, 0);
+
+		if (uid_valid(ns_root_uid))
+			*uid = ns_root_uid;
+
+		if (gid_valid(ns_root_gid))
+			*gid = ns_root_gid;
+	}
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int __init genhd_device_init(void)
 {
 	int error;
 
 	block_class.dev_kobj = sysfs_dev_block_kobj;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	kobj_ns_type_register(&user_ns_type_operations);
+#endif
 	error = class_register(&block_class);
 	if (unlikely(error))
 		return error;
@@ -1369,8 +1439,14 @@ static void disk_release(struct device *dev)
 		blk_put_queue(disk->queue);
 	kfree(disk);
 }
+
 struct class block_class = {
 	.name		= "block",
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	.ns_type	= &user_ns_type_operations,
+	.namespace	= block_class_user_namespace,
+	.get_ownership	= block_class_get_ownership,
+#endif
 };
 
 static char *block_devnode(struct device *dev, umode_t *mode,
@@ -1550,6 +1626,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 		disk_to_dev(disk)->class = &block_class;
 		disk_to_dev(disk)->type = &disk_type;
 		device_initialize(disk_to_dev(disk));
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		disk->user_ns = &init_user_ns;
+#endif
 	}
 	return disk;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 4ba7b36103de..699b7b67f9e0 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -79,12 +79,15 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
 }
 
 extern struct net init_net;
+extern struct user_namespace init_user_ns;
 
 static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
 {
 	switch (ns_type) {
 	case KOBJ_NS_TYPE_NET:
 		return &init_net;
+	case KOBJ_NS_TYPE_USER:
+		return &init_user_ns;
 	default:
 		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
 	}
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 5e2ec88a709e..99b82a0ae7ea 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 
 	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
 		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
+		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -85,6 +88,7 @@ static void sysfs_kill_sb(struct super_block *sb)
 
 	kernfs_kill_sb(sb);
 	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
+	kobj_ns_drop(KOBJ_NS_TYPE_USER, ns[KOBJ_NS_TYPE_USER]);
 }
 
 static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 07dc91835b98..e5cf5caea345 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -219,6 +219,9 @@ struct gendisk {
 	int node_id;
 	struct badblocks *bb;
 	struct lockdep_map lockdep_map;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct user_namespace *user_ns;
+#endif
 };
 
 static inline struct gendisk *part_to_disk(struct hd_struct *part)
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 216f9112ee1d..a9c45bcce235 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -26,6 +26,7 @@ struct kobject;
 enum kobj_ns_type {
 	KOBJ_NS_TYPE_NONE = 0,
 	KOBJ_NS_TYPE_NET,
+	KOBJ_NS_TYPE_USER,
 	KOBJ_NS_TYPES
 };
 
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 7/8] loopfs: start attaching correct namespace during loop_add()
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (5 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 6/8] genhd: add minimal namespace infrastructure Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-08 15:21 ` [PATCH 8/8] loopfs: only show devices in their correct instance Christian Brauner
  2020-04-08 16:24 ` [PATCH 0/8] loopfs Jann Horn
  8 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Now that kernfs and the block_class optionally support setting user
namespace tags we can start tagging loop devices with the namespace the
loopfs instance was mounted in. This has the consequence that loopfs
devices carry the correct sysfs permissions for all their core files.
All other classes will continue to be correctly owned by the initial
namespaces. Here is sample output:

root@b1:~# mount -t loop loop /mnt
root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
root@b1:~# losetup -f
/dev/loop8
root@b1:~# ln -sf /mnt/loop8 /dev/loop8
root@b1:~# ls -al /sys/class/block/loop8
lrwxrwxrwx 1 root root 0 Apr  7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
root@b1:~# ls -al /sys/class/block/loop8/
total 0
drwxr-xr-x  9 root   root       0 Apr  7 13:06 .
drwxr-xr-x 18 nobody nogroup    0 Apr  7 13:07 ..
-r--r--r--  1 root   root    4096 Apr  7 13:06 alignment_offset
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 bdi -> ../../bdi/7:8
-r--r--r--  1 root   root    4096 Apr  7 13:06 capability
-r--r--r--  1 root   root    4096 Apr  7 13:06 dev
-r--r--r--  1 root   root    4096 Apr  7 13:06 discard_alignment
-r--r--r--  1 root   root    4096 Apr  7 13:06 events
-r--r--r--  1 root   root    4096 Apr  7 13:06 events_async
-rw-r--r--  1 root   root    4096 Apr  7 13:06 events_poll_msecs
-r--r--r--  1 root   root    4096 Apr  7 13:06 ext_range
-r--r--r--  1 root   root    4096 Apr  7 13:06 hidden
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 holders
-r--r--r--  1 root   root    4096 Apr  7 13:06 inflight
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 integrity
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 mq
drwxr-xr-x  2 root   root       0 Apr  7 13:06 power
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 queue
-r--r--r--  1 root   root    4096 Apr  7 13:06 range
-r--r--r--  1 root   root    4096 Apr  7 13:06 removable
-r--r--r--  1 root   root    4096 Apr  7 13:06 ro
-r--r--r--  1 root   root    4096 Apr  7 13:06 size
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 slaves
-r--r--r--  1 root   root    4096 Apr  7 13:06 stat
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 subsystem -> ../../../../class/block
drwxr-xr-x  2 root   root       0 Apr  7 13:06 trace
-rw-r--r--  1 root   root    4096 Apr  7 13:06 uevent
root@b1:~#

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b1c3436d6e38..7a14fd3e4329 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2154,6 +2154,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (loopfs_i_sb(inode))
+		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+#endif
 
 	add_disk(disk);
 
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 8/8] loopfs: only show devices in their correct instance
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (6 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 7/8] loopfs: start attaching correct namespace during loop_add() Christian Brauner
@ 2020-04-08 15:21 ` Christian Brauner
  2020-04-08 16:24 ` [PATCH 0/8] loopfs Jann Horn
  8 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-08 15:21 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Since loopfs devices belong to a loopfs instance they have no business
polluting the host's devtmpfs mount and should not propagate out of the
namespace they belong to.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 drivers/base/devtmpfs.c | 4 ++--
 drivers/block/loop.c    | 4 +++-
 include/linux/device.h  | 3 +++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..77371ceb88fa 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -111,7 +111,7 @@ int devtmpfs_create_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.mode = 0;
@@ -138,7 +138,7 @@ int devtmpfs_delete_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7a14fd3e4329..df75ca4ac040 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2155,8 +2155,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
 #ifdef CONFIG_BLK_DEV_LOOPFS
-	if (loopfs_i_sb(inode))
+	if (loopfs_i_sb(inode)) {
 		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+		disk_to_dev(disk)->no_devnode = true;
+	}
 #endif
 
 	add_disk(disk);
diff --git a/include/linux/device.h b/include/linux/device.h
index fa04dfd22bbc..9fa438e3e4ca 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -525,6 +525,8 @@ struct dev_links_info {
  *		  sync_state() callback.
  * @dma_coherent: this particular device is dma coherent, even if the
  *		architecture supports non-coherent devices.
+ * @no_devnode: whether device nodes associated with this device are kept out
+ *		of devtmpfs (e.g. due to separate filesystem)
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -625,6 +627,7 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+	bool			no_devnode:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
-- 
2.26.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/8] loopfs
  2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
                   ` (7 preceding siblings ...)
  2020-04-08 15:21 ` [PATCH 8/8] loopfs: only show devices in their correct instance Christian Brauner
@ 2020-04-08 16:24 ` Jann Horn
  2020-04-08 16:41   ` Stéphane Graber
  8 siblings, 1 reply; 29+ messages in thread
From: Jann Horn @ 2020-04-08 16:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, kernel list, linux-block,
	Linux API, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, Network Development, Matthew Garrett, linux-fsdevel

On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> One of the use-cases for loopfs is to allow to dynamically allocate loop
> devices in sandboxed workloads without exposing /dev or
> /dev/loop-control to the workload in question and without having to
> implement a complex and also racy protocol to send around file
> descriptors for loop devices. With loopfs each mount is a new instance,
> i.e. loop devices created in one loopfs instance are independent of any
> loop devices created in another loopfs instance. This allows
> sufficiently privileged tools to have their own private stash of loop
> device instances. Dmitry has expressed his desire to use this for
> syzkaller in a private discussion. And various parties that want to use
> it are Cced here too.
>
> In addition, the loopfs filesystem can be mounted by user namespace root
> and is thus suitable for use in containers. Combined with syscall
> interception this makes it possible to securely delegate mounting of
> images on loop devices, i.e. when a user calls mount -o loop <image>
> <mountpoint> it will be possible to completely setup the loop device.
> The final mount syscall to actually perform the mount will be handled
> through syscall interception and be performed by a sufficiently
> privileged process. Syscall interception is already supported through a
> new seccomp feature we implemented in [1] and extended in [2] and is
> actively used in production workloads. The additional loopfs work will
> be used there and in various other workloads too. You'll find a short
> illustration how this works with syscall interception below in [4].

Would that privileged process then allow you to mount your filesystem
images with things like ext4? As far as I know, the filesystem
maintainers don't generally consider "untrusted filesystem image" to
be a strongly enforced security boundary; and worse, if an attacker
has access to a loop device from which something like ext4 is mounted,
things like "struct ext4_dir_entry_2" will effectively be in shared
memory, and an attacker can trivially bypass e.g.
ext4_check_dir_entry(). At the moment, that's not a huge problem (for
anything other than kernel lockdown) because only root normally has
access to loop devices.

Ubuntu carries an out-of-tree patch that afaik blocks the shared
memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>

But even with that patch, I'm not super excited about exposing
filesystem image parsing attack surface to containers unless you run
the filesystem in a sandboxed environment (at which point you don't
need a loop device anymore either).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/8] loopfs
  2020-04-08 16:24 ` [PATCH 0/8] loopfs Jann Horn
@ 2020-04-08 16:41   ` Stéphane Graber
  2020-04-09  7:02     ` Dmitry Vyukov
  0 siblings, 1 reply; 29+ messages in thread
From: Stéphane Graber @ 2020-04-08 16:41 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Jens Axboe, Greg Kroah-Hartman, kernel list,
	linux-block, Linux API, Jonathan Corbet, Serge Hallyn,
	Rafael J. Wysocki, Tejun Heo, David S. Miller, Saravana Kannan,
	Jan Kara, David Howells, Seth Forshee, David Rheinsberg,
	Tom Gundersen, Christian Kellner, Dmitry Vyukov, linux-doc,
	Network Development, Matthew Garrett, linux-fsdevel

On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > devices in sandboxed workloads without exposing /dev or
> > /dev/loop-control to the workload in question and without having to
> > implement a complex and also racy protocol to send around file
> > descriptors for loop devices. With loopfs each mount is a new instance,
> > i.e. loop devices created in one loopfs instance are independent of any
> > loop devices created in another loopfs instance. This allows
> > sufficiently privileged tools to have their own private stash of loop
> > device instances. Dmitry has expressed his desire to use this for
> > syzkaller in a private discussion. And various parties that want to use
> > it are Cced here too.
> >
> > In addition, the loopfs filesystem can be mounted by user namespace root
> > and is thus suitable for use in containers. Combined with syscall
> > interception this makes it possible to securely delegate mounting of
> > images on loop devices, i.e. when a user calls mount -o loop <image>
> > <mountpoint> it will be possible to completely setup the loop device.
> > The final mount syscall to actually perform the mount will be handled
> > through syscall interception and be performed by a sufficiently
> > privileged process. Syscall interception is already supported through a
> > new seccomp feature we implemented in [1] and extended in [2] and is
> > actively used in production workloads. The additional loopfs work will
> > be used there and in various other workloads too. You'll find a short
> > illustration how this works with syscall interception below in [4].
>
> Would that privileged process then allow you to mount your filesystem
> images with things like ext4? As far as I know, the filesystem
> maintainers don't generally consider "untrusted filesystem image" to
> be a strongly enforced security boundary; and worse, if an attacker
> has access to a loop device from which something like ext4 is mounted,
> things like "struct ext4_dir_entry_2" will effectively be in shared
> memory, and an attacker can trivially bypass e.g.
> ext4_check_dir_entry(). At the moment, that's not a huge problem (for
> anything other than kernel lockdown) because only root normally has
> access to loop devices.
>
> Ubuntu carries an out-of-tree patch that afaik blocks the shared
> memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>
>
> But even with that patch, I'm not super excited about exposing
> filesystem image parsing attack surface to containers unless you run
> the filesystem in a sandboxed environment (at which point you don't
> need a loop device anymore either).

So in general we certainly agree that you should never expose someone
that you wouldn't trust with root on the host to syscall interception
mounting of real kernel filesystems.

But that's not all that our syscall interception logic can do. We have
support for rewriting a normal filesystem mount attempt to instead use
an available FUSE implementation. As far as the user is concerned,
they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted
on /mnt as requested, except that the container manager intercepted
the mount attempt and instead spawned fuse2fs for that mount. This
requires absolutely no change to the software the user is running.

loopfs, with that interception mode, will let us also handle all cases
where a loop would be used, similarly without needing any change to
the software being run. If a piece of software calls the command
"mount -o loop blah.img /mnt", the "mount" command will setup a loop
device as it normally would (doing so through loopfs) and then will
call the "mount" syscall, which will get intercepted and redirected to
a FUSE implementation if so configured, resulting in the expected
filesystem being mounted for the user.

LXD with syscall interception offers both straight up privileged
mounting using the kernel fs or using a FUSE based implementation.
This is configurable on a per-filesystem and per-container basis.

I hope that clarifies what we're doing here :)

Stéphane

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-08 15:21 ` [PATCH 2/8] loopfs: implement loopfs Christian Brauner
@ 2020-04-09  5:39   ` David Rheinsberg
  2020-04-09  8:26     ` Christian Brauner
  2020-04-09  7:53   ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: David Rheinsberg @ 2020-04-09  5:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

Hi

On Wed, Apr 8, 2020 at 5:27 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> This implements loopfs, a loop device filesystem. It takes inspiration
> from the binderfs filesystem I implemented about two years ago and with
> which we had overally good experiences so far. Parts of it are also
> based on [3] but it's mostly a new, imho cleaner approach.
>
> One of the use-cases for loopfs is to allow to dynamically allocate loop
> devices in sandboxed workloads without exposing /dev or
> /dev/loop-control to the workload in question and without having to
> implement a complex and also racy protocol to send around file
> descriptors for loop devices. With loopfs each mount is a new instance,
> i.e. loop devices created in one loopfs instance are independent of any
> loop devices created in another loopfs instance. This allows
> sufficiently privileged tools to have their own private stash of loop
> device instances.
>
> In addition, the loopfs filesystem can be mounted by user namespace root
> and is thus suitable for use in containers. Combined with syscall
> interception this makes it possible to securely delegate mounting of
> images on loop devices, i.e. when a users calls mount -o loop <image>
> <mountpoint> it will be possible to completely setup the loop device
> (enabled in later patches) and the mount syscall to actually perform the
> mount will be handled through syscall interception and be performed by a
> sufficiently privileged process. Syscall interception is already
> supported through a new seccomp feature we implemented in [1] and
> extended in [2] and is actively used in production workloads. The
> additional loopfs work will be used there and in various other workloads
> too.
>
> The number of loop devices available to a loopfs instance can be limited
> by setting the "max" mount option to a positive integer. This e.g.
> allows sufficiently privileged processes to dynamically enforce a limit
> on the number of devices. This limit is dynamic in contrast to the
> max_loop module option in that a sufficiently privileged process can
> update it with a simple remount operation.
>
> The loopfs filesystem is placed under a new config option and special
> care has been taken to not introduce any new code when users do not
> select this config option.
>
> Note that in __loop_clr_fd() we now need not just check whether bdev is
> valid but also whether bdev->bd_disk is valid. This wasn't necessary
> before because in order to call LOOP_CLR_FD the loop device would need
> to be open and thus bdev->bd_disk was guaranteed to be allocated. For
> loopfs loop devices we allow callers to simply unlink them just as we do
> for binderfs binder devices and we do also need to account for the case
> where a loopfs superblock is shutdown while backing files might still be
> associated with some loop devices. In such cases no bd_disk device will
> be attached to bdev. This is not in itself noteworthy it's more about
> documenting the "why" of the added bdev->bd_disk check for posterity.
>
> [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
> [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Seth Forshee <seth.forshee@canonical.com>
> Cc: Tom Gundersen <teg@jklm.no>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christian Kellner <ckellner@redhat.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Rheinsberg <david.rheinsberg@gmail.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  MAINTAINERS                   |   5 +
>  drivers/block/Kconfig         |   4 +
>  drivers/block/Makefile        |   1 +
>  drivers/block/loop.c          | 151 +++++++++---
>  drivers/block/loop.h          |   8 +-
>  drivers/block/loopfs/Makefile |   3 +
>  drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++++++++++++++++
>  drivers/block/loopfs/loopfs.h |  35 +++
>  include/uapi/linux/magic.h    |   1 +
>  9 files changed, 600 insertions(+), 37 deletions(-)
>  create mode 100644 drivers/block/loopfs/Makefile
>  create mode 100644 drivers/block/loopfs/loopfs.c
>  create mode 100644 drivers/block/loopfs/loopfs.h
>
[...]
> diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
> new file mode 100644
> index 000000000000..ac46aa337008
> --- /dev/null
> +++ b/drivers/block/loopfs/loopfs.c
> @@ -0,0 +1,429 @@
[...]
> +/**
> + * loopfs_loop_device_create - allocate inode from super block of a loopfs mount
> + * @lo:                loop device for which we are creating a new device entry
> + * @ref_inode: inode from wich the super block will be taken
> + * @device_nr:  device number of the associated disk device
> + *
> + * This function creates a new device node for @lo.
> + * Minor numbers are limited and tracked globally. The
> + * function will stash a struct loop_device for the specific loop
> + * device in i_private of the inode.
> + * It will go on to allocate a new inode from the super block of the
> + * filesystem mount, stash a struct loop_device in its i_private field
> + * and attach a dentry to that inode.
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +int loopfs_loop_device_create(struct loop_device *lo, struct inode *ref_inode,
> +                             dev_t device_nr)
> +{
> +       char name[DISK_NAME_LEN];
> +       struct super_block *sb;
> +       struct loopfs_info *info;
> +       struct dentry *root, *dentry;
> +       struct inode *inode;
> +
> +       sb = loopfs_i_sb(ref_inode);
> +       if (!sb)
> +               return 0;
> +
> +       if (MAJOR(device_nr) != LOOP_MAJOR)
> +               return -EINVAL;
> +
> +       info = LOOPFS_SB(sb);
> +       if ((info->device_count + 1) > info->mount_opts.max)
> +               return -ENOSPC;

Can you elaborate what the use-case for this limit is?

With loopfs in place, any process can create its own user_ns, mount
their private loopfs and create as many loop-devices as they want.
Hence, this limit does not serve as an effective global
resource-control. Secondly, anyone with access to `loop-control` can
now create loop instances until this limit is hit, thus causing anyone
else to be unable to create more. This effectively prevents you from
sharing a loopfs between non-trusting parties. I am unsure where that
limit would actually be used?

Thanks
David

> +
> +       if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name))
> +               return -EINVAL;
> +
> +       inode = new_inode(sb);
> +       if (!inode)
> +               return -ENOMEM;
> +
> +       /*
> +        * The i_fop field will be set to the correct fops by the device layer
> +        * when the loop device in this loopfs instance is opened.
> +        */
> +       inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
> +       inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> +       inode->i_uid = info->root_uid;
> +       inode->i_gid = info->root_gid;
> +       init_special_inode(inode, S_IFBLK | 0600, device_nr);
> +
> +       root = sb->s_root;
> +       inode_lock(d_inode(root));
> +       /* look it up */
> +       dentry = lookup_one_len(name, root, strlen(name));
> +       if (IS_ERR(dentry)) {
> +               inode_unlock(d_inode(root));
> +               iput(inode);
> +               return PTR_ERR(dentry);
> +       }
> +
> +       if (d_really_is_positive(dentry)) {
> +               /* already exists */
> +               dput(dentry);
> +               inode_unlock(d_inode(root));
> +               iput(inode);
> +               return -EEXIST;
> +       }
> +
> +       d_instantiate(dentry, inode);
> +       fsnotify_create(d_inode(root), dentry);
> +       inode_unlock(d_inode(root));
> +
> +       inode->i_private = lo;
> +       lo->lo_loopfs_i = inode;
> +       info->device_count++;
> +
> +       return 0;
> +}
[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/8] loopfs
  2020-04-08 16:41   ` Stéphane Graber
@ 2020-04-09  7:02     ` Dmitry Vyukov
  0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Vyukov @ 2020-04-09  7:02 UTC (permalink / raw)
  To: Stéphane Graber
  Cc: Jann Horn, Christian Brauner, Jens Axboe, Greg Kroah-Hartman,
	kernel list, linux-block, Linux API, Jonathan Corbet,
	Serge Hallyn, Rafael J. Wysocki, Tejun Heo, David S. Miller,
	Saravana Kannan, Jan Kara, David Howells, Seth Forshee,
	David Rheinsberg, Tom Gundersen, Christian Kellner,
	open list:DOCUMENTATION, Network Development, Matthew Garrett,
	linux-fsdevel, syzkaller

On Wed, Apr 8, 2020 at 6:41 PM Stéphane Graber <stgraber@ubuntu.com> wrote:
>
> On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > > devices in sandboxed workloads without exposing /dev or
> > > /dev/loop-control to the workload in question and without having to
> > > implement a complex and also racy protocol to send around file
> > > descriptors for loop devices. With loopfs each mount is a new instance,
> > > i.e. loop devices created in one loopfs instance are independent of any
> > > loop devices created in another loopfs instance. This allows
> > > sufficiently privileged tools to have their own private stash of loop
> > > device instances. Dmitry has expressed his desire to use this for
> > > syzkaller in a private discussion. And various parties that want to use
> > > it are Cced here too.
> > >
> > > In addition, the loopfs filesystem can be mounted by user namespace root
> > > and is thus suitable for use in containers. Combined with syscall
> > > interception this makes it possible to securely delegate mounting of
> > > images on loop devices, i.e. when a user calls mount -o loop <image>
> > > <mountpoint> it will be possible to completely setup the loop device.
> > > The final mount syscall to actually perform the mount will be handled
> > > through syscall interception and be performed by a sufficiently
> > > privileged process. Syscall interception is already supported through a
> > > new seccomp feature we implemented in [1] and extended in [2] and is
> > > actively used in production workloads. The additional loopfs work will
> > > be used there and in various other workloads too. You'll find a short
> > > illustration how this works with syscall interception below in [4].
> >
> > Would that privileged process then allow you to mount your filesystem
> > images with things like ext4? As far as I know, the filesystem
> > maintainers don't generally consider "untrusted filesystem image" to
> > be a strongly enforced security boundary; and worse, if an attacker
> > has access to a loop device from which something like ext4 is mounted,
> > things like "struct ext4_dir_entry_2" will effectively be in shared
> > memory, and an attacker can trivially bypass e.g.
> > ext4_check_dir_entry(). At the moment, that's not a huge problem (for
> > anything other than kernel lockdown) because only root normally has
> > access to loop devices.
> >
> > Ubuntu carries an out-of-tree patch that afaik blocks the shared
> > memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>
> >
> > But even with that patch, I'm not super excited about exposing
> > filesystem image parsing attack surface to containers unless you run
> > the filesystem in a sandboxed environment (at which point you don't
> > need a loop device anymore either).
>
> So in general we certainly agree that you should never expose someone
> that you wouldn't trust with root on the host to syscall interception
> mounting of real kernel filesystems.
>
> But that's not all that our syscall interception logic can do. We have
> support for rewriting a normal filesystem mount attempt to instead use
> an available FUSE implementation. As far as the user is concerned,
> they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted
> on /mnt as requested, except that the container manager intercepted
> the mount attempt and instead spawned fuse2fs for that mount. This
> requires absolutely no change to the software the user is running.
>
> loopfs, with that interception mode, will let us also handle all cases
> where a loop would be used, similarly without needing any change to
> the software being run. If a piece of software calls the command
> "mount -o loop blah.img /mnt", the "mount" command will setup a loop
> device as it normally would (doing so through loopfs) and then will
> call the "mount" syscall, which will get intercepted and redirected to
> a FUSE implementation if so configured, resulting in the expected
> filesystem being mounted for the user.
>
> LXD with syscall interception offers both straight up privileged
> mounting using the kernel fs or using a FUSE based implementation.
> This is configurable on a per-filesystem and per-container basis.
>
> I hope that clarifies what we're doing here :)
>
> Stéphane


Hi Christian,

Our use case for loopfs in syzkaller would be isolation of several
test processes from each other.
Currently all loop devices and loop-control are global and cause test
processes to collide, which in turn causes non-reproducible coverage
and non-reproducible crashes. Ideally we give each test process its
own loopfs instance.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-08 15:21 ` [PATCH 2/8] loopfs: implement loopfs Christian Brauner
  2020-04-09  5:39   ` David Rheinsberg
@ 2020-04-09  7:53   ` Christoph Hellwig
  2020-04-09  8:33     ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2020-04-09  7:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

Almost 600 lines of code for a little bit of fine grained control
is the wrong tradeoff.  Please find a cheaper way to do this.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-09  5:39   ` David Rheinsberg
@ 2020-04-09  8:26     ` Christian Brauner
  2020-04-12 10:38       ` David Rheinsberg
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-09  8:26 UTC (permalink / raw)
  To: David Rheinsberg
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

On Thu, Apr 09, 2020 at 07:39:18AM +0200, David Rheinsberg wrote:
> Hi
> 
> On Wed, Apr 8, 2020 at 5:27 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > This implements loopfs, a loop device filesystem. It takes inspiration
> > from the binderfs filesystem I implemented about two years ago and with
> > which we had overally good experiences so far. Parts of it are also
> > based on [3] but it's mostly a new, imho cleaner approach.
> >
> > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > devices in sandboxed workloads without exposing /dev or
> > /dev/loop-control to the workload in question and without having to
> > implement a complex and also racy protocol to send around file
> > descriptors for loop devices. With loopfs each mount is a new instance,
> > i.e. loop devices created in one loopfs instance are independent of any
> > loop devices created in another loopfs instance. This allows
> > sufficiently privileged tools to have their own private stash of loop
> > device instances.
> >
> > In addition, the loopfs filesystem can be mounted by user namespace root
> > and is thus suitable for use in containers. Combined with syscall
> > interception this makes it possible to securely delegate mounting of
> > images on loop devices, i.e. when a users calls mount -o loop <image>
> > <mountpoint> it will be possible to completely setup the loop device
> > (enabled in later patches) and the mount syscall to actually perform the
> > mount will be handled through syscall interception and be performed by a
> > sufficiently privileged process. Syscall interception is already
> > supported through a new seccomp feature we implemented in [1] and
> > extended in [2] and is actively used in production workloads. The
> > additional loopfs work will be used there and in various other workloads
> > too.
> >
> > The number of loop devices available to a loopfs instance can be limited
> > by setting the "max" mount option to a positive integer. This e.g.
> > allows sufficiently privileged processes to dynamically enforce a limit
> > on the number of devices. This limit is dynamic in contrast to the
> > max_loop module option in that a sufficiently privileged process can
> > update it with a simple remount operation.
> >
> > The loopfs filesystem is placed under a new config option and special
> > care has been taken to not introduce any new code when users do not
> > select this config option.
> >
> > Note that in __loop_clr_fd() we now need not just check whether bdev is
> > valid but also whether bdev->bd_disk is valid. This wasn't necessary
> > before because in order to call LOOP_CLR_FD the loop device would need
> > to be open and thus bdev->bd_disk was guaranteed to be allocated. For
> > loopfs loop devices we allow callers to simply unlink them just as we do
> > for binderfs binder devices and we do also need to account for the case
> > where a loopfs superblock is shutdown while backing files might still be
> > associated with some loop devices. In such cases no bd_disk device will
> > be attached to bdev. This is not in itself noteworthy it's more about
> > documenting the "why" of the added bdev->bd_disk check for posterity.
> >
> > [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> > [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
> > [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Seth Forshee <seth.forshee@canonical.com>
> > Cc: Tom Gundersen <teg@jklm.no>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Christian Kellner <ckellner@redhat.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: David Rheinsberg <david.rheinsberg@gmail.com>
> > Cc: Dmitry Vyukov <dvyukov@google.com>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  MAINTAINERS                   |   5 +
> >  drivers/block/Kconfig         |   4 +
> >  drivers/block/Makefile        |   1 +
> >  drivers/block/loop.c          | 151 +++++++++---
> >  drivers/block/loop.h          |   8 +-
> >  drivers/block/loopfs/Makefile |   3 +
> >  drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++++++++++++++++
> >  drivers/block/loopfs/loopfs.h |  35 +++
> >  include/uapi/linux/magic.h    |   1 +
> >  9 files changed, 600 insertions(+), 37 deletions(-)
> >  create mode 100644 drivers/block/loopfs/Makefile
> >  create mode 100644 drivers/block/loopfs/loopfs.c
> >  create mode 100644 drivers/block/loopfs/loopfs.h
> >
> [...]
> > diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
> > new file mode 100644
> > index 000000000000..ac46aa337008
> > --- /dev/null
> > +++ b/drivers/block/loopfs/loopfs.c
> > @@ -0,0 +1,429 @@
> [...]
> > +/**
> > + * loopfs_loop_device_create - allocate inode from super block of a loopfs mount
> > + * @lo:                loop device for which we are creating a new device entry
> > + * @ref_inode: inode from wich the super block will be taken
> > + * @device_nr:  device number of the associated disk device
> > + *
> > + * This function creates a new device node for @lo.
> > + * Minor numbers are limited and tracked globally. The
> > + * function will stash a struct loop_device for the specific loop
> > + * device in i_private of the inode.
> > + * It will go on to allocate a new inode from the super block of the
> > + * filesystem mount, stash a struct loop_device in its i_private field
> > + * and attach a dentry to that inode.
> > + *
> > + * Return: 0 on success, negative errno on failure
> > + */
> > +int loopfs_loop_device_create(struct loop_device *lo, struct inode *ref_inode,
> > +                             dev_t device_nr)
> > +{
> > +       char name[DISK_NAME_LEN];
> > +       struct super_block *sb;
> > +       struct loopfs_info *info;
> > +       struct dentry *root, *dentry;
> > +       struct inode *inode;
> > +
> > +       sb = loopfs_i_sb(ref_inode);
> > +       if (!sb)
> > +               return 0;
> > +
> > +       if (MAJOR(device_nr) != LOOP_MAJOR)
> > +               return -EINVAL;
> > +
> > +       info = LOOPFS_SB(sb);
> > +       if ((info->device_count + 1) > info->mount_opts.max)
> > +               return -ENOSPC;
> 
> Can you elaborate what the use-case for this limit is?

Sure.

> 
> With loopfs in place, any process can create its own user_ns, mount
> their private loopfs and create as many loop-devices as they want.
> Hence, this limit does not serve as an effective global
> resource-control. Secondly, anyone with access to `loop-control` can
> now create loop instances until this limit is hit, thus causing anyone
> else to be unable to create more. This effectively prevents you from
> sharing a loopfs between non-trusting parties. I am unsure where that
> limit would actually be used?

Restricting it globally indeed wasn't the intended use-case for it. This
was more so that you can specify an instance limit, bind-mount that
instance into several places and sufficiently locked down users cannot
exceed the instance limit. 
I don't think we'd be getting much out of a global limit per se I think
the initial namespace being able to reserve a bunch of devices
they can always rely on being able create when they need them is more
interesting. This is similat to what devpts implements with the
"reserved" mount option and what I initially proposed for binderfs. For
the latter it was deemed unnecessary by others so I dropped it from
loopfs too.
I also expect most users to pre-create devices in the initial namespace
instance they need (e.g. similar to what binderfs does or what loop
devices currently have). Does that make sense to you?

Thanks!
Christian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-09  7:53   ` Christoph Hellwig
@ 2020-04-09  8:33     ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-09  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev

On Thu, Apr 09, 2020 at 12:53:20AM -0700, Christoph Hellwig wrote:
> Almost 600 lines of code for a little bit of fine grained control
> is the wrong tradeoff.  Please find a cheaper way to do this.

I think that's a slight misrepresentation of the patchset. Of course, I
get reservations against adding new code but none of this code will
exist at all if the config option is not set; and the config option is
not selected by default. I don't want people to have to use something
they don't care about of course.
The patchset itself unblocks a range of use-cases we had issues with for
quite a while and the standalone, tiny filesystem approach has served us
well already, so this is not something new. It's not just gaining
fine-grained control, it's a whole set of new uses and we don't just do
it for the fun of doing it but because we do have actual users of this.

Christian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-09  8:26     ` Christian Brauner
@ 2020-04-12 10:38       ` David Rheinsberg
  2020-04-12 12:03         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: David Rheinsberg @ 2020-04-12 10:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

Hey

On Thu, Apr 9, 2020 at 10:27 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Thu, Apr 09, 2020 at 07:39:18AM +0200, David Rheinsberg wrote:
> > With loopfs in place, any process can create its own user_ns, mount
> > their private loopfs and create as many loop-devices as they want.
> > Hence, this limit does not serve as an effective global
> > resource-control. Secondly, anyone with access to `loop-control` can
> > now create loop instances until this limit is hit, thus causing anyone
> > else to be unable to create more. This effectively prevents you from
> > sharing a loopfs between non-trusting parties. I am unsure where that
> > limit would actually be used?
>
> Restricting it globally indeed wasn't the intended use-case for it. This
> was more so that you can specify an instance limit, bind-mount that
> instance into several places and sufficiently locked down users cannot
> exceed the instance limit.

But then these users can each exhaust the limit individually. As such,
you cannot share this instance across users that have no
trust-relationship. Fine with me, but I still don't understand in
which scenario the limit would be useful. Anyone can create a user-ns,
create a new loopfs mount, and just happily create more loop-devices.
So what is so special that you want to restrict the devices on a
_single_ mount instance?

> I don't think we'd be getting much out of a global limit per se I think
> the initial namespace being able to reserve a bunch of devices
> they can always rely on being able create when they need them is more
> interesting. This is similat to what devpts implements with the
> "reserved" mount option and what I initially proposed for binderfs. For
> the latter it was deemed unnecessary by others so I dropped it from
> loopfs too.

The `reserve` of devpts has a fixed 2-tier system: A global limit, and
a init-ns reserve. This does nothing to protect one container from
another.

Furthermore, how do you intend to limit user-space from creating an
unbound amount of loop devices? Unless I am mistaken, with your
proposal *any* process can create a new loopfs with a basically
unlimited amount of loop-devices, thus easily triggering unbound
kernel allocations. I think this needs to be accounted. The classic
way is to put a per-uid limit into `struct user_struct` (done by
pipes, mlock, epoll, mq, etc.). An alternative is `struct ucount`,
which allows hierarchical management (inotify uses that, as an
example).

> I also expect most users to pre-create devices in the initial namespace
> instance they need (e.g. similar to what binderfs does or what loop
> devices currently have). Does that make sense to you?

Our use-case is to get programmatic access to loop-devices, so we can
build customer images on request (especially to create XFS images,
since mkfs.xfs cannot write them, IIRC). We would be perfectly happy
with a kernel-interface that takes a file-descriptor to a regular file
and returns us a file-descriptor to a newly created block device
(which is automatically destroyed when the last file-descriptor to it
is closed). This would be ideal *to us*, since it would do automatic
cleanup on crashes.

We don't need any representation of the loop-device in the
file-system, as long as we can somehow mount it (either by passing the
bdev-FD to the new mount-api, or by using /proc/self/fd/ as
mount-source).

With your proposed loop-fs we could achieve something close to it:
Mount a private loopfs, create a loop-device, and rely on automatic
cleanup when the mount-namespace is destroyed.

Thanks
David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-12 10:38       ` David Rheinsberg
@ 2020-04-12 12:03         ` Christian Brauner
  2020-04-12 13:04           ` Christian Brauner
  2020-04-12 13:44           ` David Rheinsberg
  0 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-12 12:03 UTC (permalink / raw)
  To: David Rheinsberg
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

On Sun, Apr 12, 2020 at 12:38:54PM +0200, David Rheinsberg wrote:
> Hey
> 
> On Thu, Apr 9, 2020 at 10:27 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > On Thu, Apr 09, 2020 at 07:39:18AM +0200, David Rheinsberg wrote:
> > > With loopfs in place, any process can create its own user_ns, mount
> > > their private loopfs and create as many loop-devices as they want.
> > > Hence, this limit does not serve as an effective global
> > > resource-control. Secondly, anyone with access to `loop-control` can
> > > now create loop instances until this limit is hit, thus causing anyone
> > > else to be unable to create more. This effectively prevents you from
> > > sharing a loopfs between non-trusting parties. I am unsure where that
> > > limit would actually be used?
> >
> > Restricting it globally indeed wasn't the intended use-case for it. This
> > was more so that you can specify an instance limit, bind-mount that
> > instance into several places and sufficiently locked down users cannot
> > exceed the instance limit.
> 
> But then these users can each exhaust the limit individually. As such,
> you cannot share this instance across users that have no
> trust-relationship. Fine with me, but I still don't understand in

Well, you can't really share anything across clients with the same
privilege level if one of them is untrusted.

> which scenario the limit would be useful. Anyone can create a user-ns,
> create a new loopfs mount, and just happily create more loop-devices.
> So what is so special that you want to restrict the devices on a
> _single_ mount instance?

To share that instance across namespaces. You can e.g. create the
mount instance in one mount namespace owned by userns1, create a second
user namespace usern2 with the same mapping which is blocked from
creating additional user namespaces either by seccomp or by
/proc/sys/user/max_user_namespaces or lsms what have you. Because it
doesn't own the mount namespace the loopfs mount it is in it can't
remount it and can't exceed the local limit.

> 
> > I don't think we'd be getting much out of a global limit per se I think
> > the initial namespace being able to reserve a bunch of devices
> > they can always rely on being able create when they need them is more
> > interesting. This is similat to what devpts implements with the
> > "reserved" mount option and what I initially proposed for binderfs. For
> > the latter it was deemed unnecessary by others so I dropped it from
> > loopfs too.
> 
> The `reserve` of devpts has a fixed 2-tier system: A global limit, and
> a init-ns reserve. This does nothing to protect one container from
> another.

What I was getting at is that what matters first and foremost is
protecting init userns.

> 
> Furthermore, how do you intend to limit user-space from creating an
> unbound amount of loop devices? Unless I am mistaken, with your
> proposal *any* process can create a new loopfs with a basically
> unlimited amount of loop-devices, thus easily triggering unbound
> kernel allocations. I think this needs to be accounted. The classic
> way is to put a per-uid limit into `struct user_struct` (done by
> pipes, mlock, epoll, mq, etc.). An alternative is `struct ucount`,
> which allows hierarchical management (inotify uses that, as an
> example).

Yeah, I know. We can certainly do this.

> 
> > I also expect most users to pre-create devices in the initial namespace
> > instance they need (e.g. similar to what binderfs does or what loop
> > devices currently have). Does that make sense to you?
> 
> Our use-case is to get programmatic access to loop-devices, so we can
> build customer images on request (especially to create XFS images,
> since mkfs.xfs cannot write them, IIRC). We would be perfectly happy
> with a kernel-interface that takes a file-descriptor to a regular file
> and returns us a file-descriptor to a newly created block device
> (which is automatically destroyed when the last file-descriptor to it
> is closed). This would be ideal *to us*, since it would do automatic
> cleanup on crashes.
> 
> We don't need any representation of the loop-device in the
> file-system, as long as we can somehow mount it (either by passing the
> bdev-FD to the new mount-api, or by using /proc/self/fd/ as
> mount-source).

We want the ability to have a filesystem representation as it will allow
us to handle a host of legacy workloads cleanly e.g. that users can just
call mount -o loop /bla whenever they have opted into syscall
interception for a particular filesystem. In addition, we can cover your
use case completely was well I think. Both with the old and new mount api.

> 
> With your proposed loop-fs we could achieve something close to it:
> Mount a private loopfs, create a loop-device, and rely on automatic
> cleanup when the mount-namespace is destroyed.

With loopfs you can do this with the old or new mount api and you don't
need to have loopfs mounted for that at all. Here's a sample program
that works right now with the old mount api:

#ifndef _GNU_SOURCE
#define _GNU_SOURCE 1
#endif
#include <linux/loop.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <limits.h>
#include <linux/bpf.h>
#include <linux/magic.h>
#include <linux/sched.h>
#include <malloc.h>
#include <poll.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/fsuid.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/param.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/types.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>

int attach_image_to_loop(const char *source, int loop_fd)
{
	int ret, fret = -1;
	struct loop_info64 lo64;
	int fd_img = -1;

	fd_img = open(source, O_RDWR | O_CLOEXEC);
	if (fd_img < 0) {
		fprintf(stderr, "Failed to open %s\n", source);
		goto on_error;
	}

	ret = ioctl(loop_fd, LOOP_SET_FD, fd_img);
	if (ret < 0) {
		fprintf(stderr, "%m - Failed to set loop device to %s\n", source);
		goto on_error;
	}

	memset(&lo64, 0, sizeof(lo64));

	snprintf((char *)lo64.lo_file_name, LO_NAME_SIZE, "%s", source);

	ret = ioctl(loop_fd, LOOP_SET_STATUS64, &lo64);
	if (ret < 0) {
		fprintf(stderr, "Failed to set loop device status for %s\n", source);
		goto on_error;
	}

	fret = 0;

on_error:
	if (fd_img >= 0)
		close(fd_img);

	return fret;
}

int main(int argc, char *argv[])
{
	int n = 1;
	int ret, mntfd, loop_ctl_fd, loopidx, loopfd;
	char path[4096];

	/* Mount loopfs. */
	ret = mount("none", "/mnt", "loop", 0, 0);
	if (ret)
		exit(n++);

	/* Stash file descriptor to mount. */
	mntfd = open("/mnt", O_DIRECTORY);
	if (mntfd < 0)
		exit(n++);

	/* Stash file descriptor to loop-control. */
	loop_ctl_fd = open("/mnt/loop-control", O_RDWR | O_CLOEXEC);
	if (loop_ctl_fd < 0)
		exit(n++);

	/*
	 * Detach mount so none can access it anymore and also we don't need it
	 * anymore.
	 */
	ret = umount2("/mnt", MNT_DETACH);
	if (ret)
		exit(n++);

	/* Get new loop device index. */
	loopidx = ioctl(loop_ctl_fd, LOOP_CTL_GET_FREE);
	if (loopidx < 0)
		exit(n++);

	/* Use openat() to open loop device in private instance. */
	snprintf(path, sizeof(path), "loop%d", loopidx);
	loopfd = openat(mntfd, path, O_RDWR | O_CLOEXEC);
	if (loopfd < 0)
		exit(n++);

	/* Attach image to loop device. */
	ret = attach_image_to_loop("/bla.img", loopfd);
	if (ret)
		exit(n++);

	/* Mount through /proc/self/fd/<nr> */
	snprintf(path, sizeof(path), "/proc/self/fd/%d", loopfd);
	ret = mount(path, "/opt", "btrfs", 0, 0);
	if (ret)
		exit(6);

	/* Repeat as often as you want or close loopfs instance. */

	exit(EXIT_SUCCESS);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-12 12:03         ` Christian Brauner
@ 2020-04-12 13:04           ` Christian Brauner
  2020-04-12 13:44           ` David Rheinsberg
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-12 13:04 UTC (permalink / raw)
  To: David Rheinsberg
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

On Sun, Apr 12, 2020 at 02:03:00PM +0200, Christian Brauner wrote:
> On Sun, Apr 12, 2020 at 12:38:54PM +0200, David Rheinsberg wrote:
> > Hey
> > 
> > On Thu, Apr 9, 2020 at 10:27 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > On Thu, Apr 09, 2020 at 07:39:18AM +0200, David Rheinsberg wrote:
> > > > With loopfs in place, any process can create its own user_ns, mount
> > > > their private loopfs and create as many loop-devices as they want.
> > > > Hence, this limit does not serve as an effective global
> > > > resource-control. Secondly, anyone with access to `loop-control` can
> > > > now create loop instances until this limit is hit, thus causing anyone
> > > > else to be unable to create more. This effectively prevents you from
> > > > sharing a loopfs between non-trusting parties. I am unsure where that
> > > > limit would actually be used?
> > >
> > > Restricting it globally indeed wasn't the intended use-case for it. This
> > > was more so that you can specify an instance limit, bind-mount that
> > > instance into several places and sufficiently locked down users cannot
> > > exceed the instance limit.
> > 
> > But then these users can each exhaust the limit individually. As such,
> > you cannot share this instance across users that have no
> > trust-relationship. Fine with me, but I still don't understand in
> 
> Well, you can't really share anything across clients with the same
> privilege level if one of them is untrusted.
> 
> > which scenario the limit would be useful. Anyone can create a user-ns,
> > create a new loopfs mount, and just happily create more loop-devices.
> > So what is so special that you want to restrict the devices on a
> > _single_ mount instance?
> 
> To share that instance across namespaces. You can e.g. create the
> mount instance in one mount namespace owned by userns1, create a second
> user namespace usern2 with the same mapping which is blocked from
> creating additional user namespaces either by seccomp or by
> /proc/sys/user/max_user_namespaces or lsms what have you. Because it
> doesn't own the mount namespace the loopfs mount it is in it can't
> remount it and can't exceed the local limit.
> 
> > 
> > > I don't think we'd be getting much out of a global limit per se I think
> > > the initial namespace being able to reserve a bunch of devices
> > > they can always rely on being able create when they need them is more
> > > interesting. This is similat to what devpts implements with the
> > > "reserved" mount option and what I initially proposed for binderfs. For
> > > the latter it was deemed unnecessary by others so I dropped it from
> > > loopfs too.
> > 
> > The `reserve` of devpts has a fixed 2-tier system: A global limit, and
> > a init-ns reserve. This does nothing to protect one container from
> > another.
> 
> What I was getting at is that what matters first and foremost is
> protecting init userns.
> 
> > 
> > Furthermore, how do you intend to limit user-space from creating an
> > unbound amount of loop devices? Unless I am mistaken, with your
> > proposal *any* process can create a new loopfs with a basically
> > unlimited amount of loop-devices, thus easily triggering unbound
> > kernel allocations. I think this needs to be accounted. The classic
> > way is to put a per-uid limit into `struct user_struct` (done by
> > pipes, mlock, epoll, mq, etc.). An alternative is `struct ucount`,
> > which allows hierarchical management (inotify uses that, as an
> > example).
> 
> Yeah, I know. We can certainly do this.
> 
> > 
> > > I also expect most users to pre-create devices in the initial namespace
> > > instance they need (e.g. similar to what binderfs does or what loop
> > > devices currently have). Does that make sense to you?
> > 
> > Our use-case is to get programmatic access to loop-devices, so we can
> > build customer images on request (especially to create XFS images,
> > since mkfs.xfs cannot write them, IIRC). We would be perfectly happy
> > with a kernel-interface that takes a file-descriptor to a regular file
> > and returns us a file-descriptor to a newly created block device
> > (which is automatically destroyed when the last file-descriptor to it
> > is closed). This would be ideal *to us*, since it would do automatic
> > cleanup on crashes.
> > 
> > We don't need any representation of the loop-device in the
> > file-system, as long as we can somehow mount it (either by passing the
> > bdev-FD to the new mount-api, or by using /proc/self/fd/ as
> > mount-source).
> 
> We want the ability to have a filesystem representation as it will allow
> us to handle a host of legacy workloads cleanly e.g. that users can just
> call mount -o loop /bla whenever they have opted into syscall
> interception for a particular filesystem. In addition, we can cover your
> use case completely was well I think. Both with the old and new mount api.
> 
> > 
> > With your proposed loop-fs we could achieve something close to it:
> > Mount a private loopfs, create a loop-device, and rely on automatic
> > cleanup when the mount-namespace is destroyed.
> 
> With loopfs you can do this with the old or new mount api and you don't
> need to have loopfs mounted for that at all. Here's a sample program
> that works right now with the old mount api:

That also led me to discover a bug I need to fix, so thanks!
Christian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/8] loopfs: implement loopfs
  2020-04-12 12:03         ` Christian Brauner
  2020-04-12 13:04           ` Christian Brauner
@ 2020-04-12 13:44           ` David Rheinsberg
  1 sibling, 0 replies; 29+ messages in thread
From: David Rheinsberg @ 2020-04-12 13:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, lkml, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, Tom Gundersen, Christian Kellner, Dmitry Vyukov,
	Stéphane Graber, linux-doc, netdev

Hi

On Sun, Apr 12, 2020 at 2:03 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
[...]
> On Sun, Apr 12, 2020 at 12:38:54PM +0200, David Rheinsberg wrote:
> > which scenario the limit would be useful. Anyone can create a user-ns,
> > create a new loopfs mount, and just happily create more loop-devices.
> > So what is so special that you want to restrict the devices on a
> > _single_ mount instance?
>
> To share that instance across namespaces. You can e.g. create the
> mount instance in one mount namespace owned by userns1, create a second
> user namespace usern2 with the same mapping which is blocked from
> creating additional user namespaces either by seccomp or by
> /proc/sys/user/max_user_namespaces or lsms what have you. Because it
> doesn't own the mount namespace the loopfs mount it is in it can't
> remount it and can't exceed the local limit.

Right. But now you re-use the userns-limit to also limit loopfs (or
other userns restrictions to limit loopfs access). Existing safe
setups allow contained processes to create their own user-namespace.
With your patchset merged, every such existing contained system with
userns-access gets access to a kernel API that allows them unbound
kernel memory allocations. I don't think you can tell every existing
system to not enable CONFIG_LOOP_FS. Or to make sure to install
seccomp filters before updating their kernels. Right? These setups
already exist, and they happily use distribution kernels.

I think there is no way around `struct user_struct`, `struct ucount`,
or whatever you like.

> > Furthermore, how do you intend to limit user-space from creating an
> > unbound amount of loop devices? Unless I am mistaken, with your
> > proposal *any* process can create a new loopfs with a basically
> > unlimited amount of loop-devices, thus easily triggering unbound
> > kernel allocations. I think this needs to be accounted. The classic
> > way is to put a per-uid limit into `struct user_struct` (done by
> > pipes, mlock, epoll, mq, etc.). An alternative is `struct ucount`,
> > which allows hierarchical management (inotify uses that, as an
> > example).
>
> Yeah, I know. We can certainly do this.

My point is, I think we have to.

[...]
> > With your proposed loop-fs we could achieve something close to it:
> > Mount a private loopfs, create a loop-device, and rely on automatic
> > cleanup when the mount-namespace is destroyed.
>
> With loopfs you can do this with the old or new mount api and you don't
> need to have loopfs mounted for that at all. Here's a sample program
> that works right now with the old mount api:

Yeah, loopfs would certainly allow this, and I would be perfectly
happy with this API. I think it is overly heavy for the use-case we
have, but I do acknowledge that there are other use-cases as well.
But I think your claim that "you don't need to have loopfs mounted" is
misleading. loopfs must be mounted for the entirety of the program.
Instead, you don't have to have it linked in your mount-namespace,
since you can immediately detach it. And with the new mount-APIs, you
don't even need it linked initially, as you can create a detached
mount right away.

Thanks
David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/8] kernfs: handle multiple namespace tags
  2020-04-08 15:21 ` [PATCH 4/8] kernfs: handle multiple namespace tags Christian Brauner
@ 2020-04-13 18:46   ` Tejun Heo
  0 siblings, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2020-04-13 18:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

On Wed, Apr 08, 2020 at 05:21:47PM +0200, Christian Brauner wrote:
> Since [1] kernfs supports namespace tags. This feature is essential to
> enable sysfs to present different views of on various parts depending on
> the namespace tag. For example, the /sys/class/net/ directory will only
> show network devices that belong to the network namespace that sysfs was
> mounted in. This is achieved by stashing a reference to the network
> namespace of the task mounting sysfs in the super block. And when a
> lookup operation is performed on e.g. /sys/class/net/ kernfs will
> compare the network namespace tag of the kernfs_node associated with the
> device and kobject of the network device to the network namespace of the
> network device. This ensures that only network devices owned by the
> network namespace sysfs was mounted in are shown, a feature which is
> essential to containers.
> For loopfs to show correct permissions in sysfs just as with network
> devices we need to be able to tag kernfs_super_info with additional
> namespaces. This extension was even already mentioned in a comment to
> struct kernfs_super_info:
>   /*
>    * Each sb is associated with one namespace tag, currently the
>    * network namespace of the task which mounted this kernfs
>    * instance.  If multiple tags become necessary, make the following
>    * an array and compare kernfs_node tag against every entry.
>    */
> This patch extends the kernfs_super_info and kernfs_fs_context ns
> pointers to fixed-size arrays of namespace tags. The size is taken from
> the namespaces currently supported by kobjects, i.e. we don't extend it
> to cover all namespace but only the ones kernfs needs to support.
> In addition, the kernfs_node struct gains an additional member that
> indicates the type of namespace this kernfs_node was tagged with. This
> allows us to simply retrieve the correct namespace tag from the
> kernfs_fs_context and kernfs_super_info ns array with a simple indexing
> operation. This has the advantage that we can just keep passing down the
> correct namespace instead of passing down the array.
> 
> [1]: 608b4b9548de ("netns: Teach network device kobjects which namespace they are in.")
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-08 15:21 ` [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace Christian Brauner
@ 2020-04-13 19:02   ` Tejun Heo
  2020-04-13 19:39     ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2020-04-13 19:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

Hello,

On Wed, Apr 08, 2020 at 05:21:48PM +0200, Christian Brauner wrote:
> The initial namespace is special in many ways. One feature it always has
> had is that it propagates all its devices into all non-initial
> namespaces. This is e.g. true for all device classes under /sys/class/

Maybe I'm missing your point but I've always thought of it the other way
around. Some namespaces make all objects visible in init_ns so that all
non-init namespaces are subset of the init one, which sometimes requires
creating aliases. Other namespaces don't do that. At least in my experience,
the former is a lot easier to administer.

The current namespace support in kernfs behaves the way it does because the
only namespace it supports is netns, but if we're expanding it, I think it
might be better to default to init_ns is superset of all others model and make
netns opt for the disjointing behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/8] genhd: add minimal namespace infrastructure
  2020-04-08 15:21 ` [PATCH 6/8] genhd: add minimal namespace infrastructure Christian Brauner
@ 2020-04-13 19:04   ` Tejun Heo
  2020-04-13 19:42     ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2020-04-13 19:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

Hello,

On Wed, Apr 08, 2020 at 05:21:49PM +0200, Christian Brauner wrote:
> This lets the block_class properly support loopfs device by introducing
> the minimal infrastructure needed to support different sysfs views for
> devices belonging to the block_class. This is similar to how network
> devices work. Note, that nothing changes with this patch since

I was hoping that all devices on the system would be visible at the root level
as administration at system level becomes pretty tricky otherwise. Is it just
me who thinks this way?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-13 19:02   ` Tejun Heo
@ 2020-04-13 19:39     ` Christian Brauner
  2020-04-13 19:45       ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-13 19:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

On Mon, Apr 13, 2020 at 03:02:39PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 08, 2020 at 05:21:48PM +0200, Christian Brauner wrote:
> > The initial namespace is special in many ways. One feature it always has
> > had is that it propagates all its devices into all non-initial
> > namespaces. This is e.g. true for all device classes under /sys/class/
> 
> Maybe I'm missing your point but I've always thought of it the other way
> around. Some namespaces make all objects visible in init_ns so that all
> non-init namespaces are subset of the init one, which sometimes requires
> creating aliases. Other namespaces don't do that. At least in my experience,
> the former is a lot easier to administer.
> 
> The current namespace support in kernfs behaves the way it does because the
> only namespace it supports is netns, but if we're expanding it, I think it
> might be better to default to init_ns is superset of all others model and make
> netns opt for the disjointing behavior.

Hey Tejun,

The point was that devices have always been shown in all namespaces. You
can see all devices everywhere. Sure that wasn't ideal but we can't
really change that behavior since it would break userspace significantly
as a lot of tools are used to that behavior.

Another problem is that you might have two devices of the same class
with the same name that belong to different namespaces and if you shown
them all in the initial namespace you get clashes. This was one of the
original reasons why network devices are only shown in the namespace
they belong to but not in any other.

The network model of only showing the device in the namespace they belong
to also has the advantage that tools do not stomp on each others feet
when using them.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/8] genhd: add minimal namespace infrastructure
  2020-04-13 19:04   ` Tejun Heo
@ 2020-04-13 19:42     ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-13 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

On Mon, Apr 13, 2020 at 03:04:52PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 08, 2020 at 05:21:49PM +0200, Christian Brauner wrote:
> > This lets the block_class properly support loopfs device by introducing
> > the minimal infrastructure needed to support different sysfs views for
> > devices belonging to the block_class. This is similar to how network
> > devices work. Note, that nothing changes with this patch since
> 
> I was hoping that all devices on the system would be visible at the root level
> as administration at system level becomes pretty tricky otherwise. Is it just
> me who thinks this way?

Hey Tejun,

I think this is the same question in a different form you had in
https://lore.kernel.org/lkml/20200413193950.tokh5m7wsyrous3c@wittgenstein/T/#m20b396a29c8d499d9dc073e6aef38f38c08f8bbe
and I tried answered it there.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-13 19:39     ` Christian Brauner
@ 2020-04-13 19:45       ` Tejun Heo
  2020-04-13 19:59         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2020-04-13 19:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

Hello,

On Mon, Apr 13, 2020 at 09:39:50PM +0200, Christian Brauner wrote:
> Another problem is that you might have two devices of the same class
> with the same name that belong to different namespaces and if you shown
> them all in the initial namespace you get clashes. This was one of the
> original reasons why network devices are only shown in the namespace
> they belong to but not in any other.

For example, pid namespace has the same issue but it doesn't solve the problem
by breaking up visibility at the root level - it makes everything visiable at
root but give per-ns aliases which are selectively visble depending on the
namespace. From administration POV, this is way easier and less error-prone to
deal with and I was hoping that we could head that way rather than netdev way
for new things.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-13 19:45       ` Tejun Heo
@ 2020-04-13 19:59         ` Christian Brauner
  2020-04-13 20:37           ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-04-13 19:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

On Mon, Apr 13, 2020 at 03:45:50PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 13, 2020 at 09:39:50PM +0200, Christian Brauner wrote:
> > Another problem is that you might have two devices of the same class
> > with the same name that belong to different namespaces and if you shown
> > them all in the initial namespace you get clashes. This was one of the
> > original reasons why network devices are only shown in the namespace
> > they belong to but not in any other.
> 
> For example, pid namespace has the same issue but it doesn't solve the problem
> by breaking up visibility at the root level - it makes everything visiable at
> root but give per-ns aliases which are selectively visble depending on the
> namespace. From administration POV, this is way easier and less error-prone to
> deal with and I was hoping that we could head that way rather than netdev way
> for new things.

Right, pid namespaces deal with a single random identifier about which
userspace makes no assumptions other than that it's a positive number so
generating aliases is fine. In addition pid namespaces are nicely
hierarchical. I fear that we might introduce unneeded complexity if we
go this way and start generating aliases for devices that userspace
already knows about and has expectations of. We also still face some of
the other problems I mentioned.
I do think that what you say might make sense to explore in more detail
for a new device class (or type under a given class) that userspace does
not yet know about and were we don't regress anything.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-13 19:59         ` Christian Brauner
@ 2020-04-13 20:37           ` Tejun Heo
  2020-04-14 10:39             ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2020-04-13 20:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

Hello,

On Mon, Apr 13, 2020 at 09:59:15PM +0200, Christian Brauner wrote:
> Right, pid namespaces deal with a single random identifier about which
> userspace makes no assumptions other than that it's a positive number so
> generating aliases is fine. In addition pid namespaces are nicely

I don't see any fundamental differences between pids and device numbers. One
of the reasons pid namespace does aliasing instead of just showing subsets is
because applications can have expectations on what the specific numbers should
be - e.g. for checkpoint-restarting.

> hierarchical. I fear that we might introduce unneeded complexity if we
> go this way and start generating aliases for devices that userspace

It adds complexity for sure but the other side of the scale is losing
visiblity into what devices are on the system, which can become really nasty
in practice, so I do think it can probably justify some additional complexity
especially if it's something which can be used by different devices. Even just
for block, if we end up expanding ns support to regular block devices for some
reason, it's kinda dreadful to think a situation where all devices can't be
discovered at the system level.

> already knows about and has expectations of. We also still face some of
> the other problems I mentioned.
> I do think that what you say might make sense to explore in more detail
> for a new device class (or type under a given class) that userspace does
> not yet know about and were we don't regress anything.

I don't quite follow why adding namespace support would break existing users.
Wouldn't namespace usage be opt-in?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace
  2020-04-13 20:37           ` Tejun Heo
@ 2020-04-14 10:39             ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-04-14 10:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev

On Mon, Apr 13, 2020 at 04:37:16PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 13, 2020 at 09:59:15PM +0200, Christian Brauner wrote:
> > Right, pid namespaces deal with a single random identifier about which
> > userspace makes no assumptions other than that it's a positive number so
> > generating aliases is fine. In addition pid namespaces are nicely
> 
> I don't see any fundamental differences between pids and device numbers. One
> of the reasons pid namespace does aliasing instead of just showing subsets is
> because applications can have expectations on what the specific numbers should
> be - e.g. for checkpoint-restarting.

One difference is that ownership is hierarchial in a pid namespace. This
becomes clear when looking at the parent child relationship when
creating new processes in nested pid namespaces. All processes created
in the innermost pid namespace are owned by that pid namespaces's init
process. If that pid namespace's init/subreaper process dies all
processes get zapped and autoreaped. In essence, unless the ancestor pid
namespace has setns()ed a process in there, ownership of that process is
clearly defined. I don't think that model is transferable to a device.
What seems most important to me here is that a pid namespace completely
defines ownership of a process. But there's not necessarily
a single namespace that guarantees ownership for all device types.
Network devices, imho are a good example for that. Their full ownership
is network namespace + user namespace actually. You could easily
envision other device classes where a combination of namespaces would
make sense.

> 
> > hierarchical. I fear that we might introduce unneeded complexity if we
> > go this way and start generating aliases for devices that userspace
> 
> It adds complexity for sure but the other side of the scale is losing
> visiblity into what devices are on the system, which can become really nasty
> in practice, so I do think it can probably justify some additional complexity
> especially if it's something which can be used by different devices. Even just
> for block, if we end up expanding ns support to regular block devices for some
> reason, it's kinda dreadful to think a situation where all devices can't be
> discovered at the system level.

Hm, it is already the case that we can't see all devices at the system
level. That includes network devices and also binderfs devices (the
latter don't have sysfs entries though which is what this is about).
And for virtual devices just as loop, binder, and network devices this
is fine imho. They are not physicall attached to your computer. Actual
disk devices where this would matter wouldn't be namespaced anyway imho.

We also need to consider that it is potentially dangerous for a
namespace to trigger a device event tricking the host into performing an
action on it. If e.g. the creation of a network device were to propagate
into all namespaces and there'd be a rule matching it you could trick
the host into performing privileged actions it. So it also isn't
obviously safe propagating devices out of their namespace. (I fixed
something similar to this just recently in a sysfs series.)

I addition the file ownership permissions would propagate from the inner
to all outer sysfs instances as well which would mean you'd suddenly
have 100000:100000 entries in your host's sysfs in the initial
namespace.

> 
> > already knows about and has expectations of. We also still face some of
> > the other problems I mentioned.
> > I do think that what you say might make sense to explore in more detail
> > for a new device class (or type under a given class) that userspace does
> > not yet know about and were we don't regress anything.
> 
> I don't quite follow why adding namespace support would break existing users.
> Wouldn't namespace usage be opt-in?

For sysfs, this change is opt-in per device type and it only applies to
loop devices here, i.e. if you don't e.g. use loopfs nothing changes
for you at all. If you use it, all that you get is correct ownership for
sysfs entries for those loop devices accounted to you in addition to all
the other entries that have always been there. This way we can handle
legacy workloads cleanly which we really want for our use-case.

Your model would effectively require a new version of sysfs where you
e.g. mount it with a new option that zaps all device entries that don't
belong to non-initial user namespaces. Which would mean most major tools
in containers will break completely. We can still totally try to bring
up a change like this independent of this patchset. This patchset
doesn't rule this out at all.

Christian

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-08 15:21 [PATCH 0/8] loopfs Christian Brauner
2020-04-08 15:21 ` [PATCH 1/8] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
2020-04-08 15:21 ` [PATCH 2/8] loopfs: implement loopfs Christian Brauner
2020-04-09  5:39   ` David Rheinsberg
2020-04-09  8:26     ` Christian Brauner
2020-04-12 10:38       ` David Rheinsberg
2020-04-12 12:03         ` Christian Brauner
2020-04-12 13:04           ` Christian Brauner
2020-04-12 13:44           ` David Rheinsberg
2020-04-09  7:53   ` Christoph Hellwig
2020-04-09  8:33     ` Christian Brauner
2020-04-08 15:21 ` [PATCH 3/8] loop: use ns_capable for some loop operations Christian Brauner
2020-04-08 15:21 ` [PATCH 4/8] kernfs: handle multiple namespace tags Christian Brauner
2020-04-13 18:46   ` Tejun Heo
2020-04-08 15:21 ` [PATCH 5/8] kernfs: let objects opt-in to propagating from the initial namespace Christian Brauner
2020-04-13 19:02   ` Tejun Heo
2020-04-13 19:39     ` Christian Brauner
2020-04-13 19:45       ` Tejun Heo
2020-04-13 19:59         ` Christian Brauner
2020-04-13 20:37           ` Tejun Heo
2020-04-14 10:39             ` Christian Brauner
2020-04-08 15:21 ` [PATCH 6/8] genhd: add minimal namespace infrastructure Christian Brauner
2020-04-13 19:04   ` Tejun Heo
2020-04-13 19:42     ` Christian Brauner
2020-04-08 15:21 ` [PATCH 7/8] loopfs: start attaching correct namespace during loop_add() Christian Brauner
2020-04-08 15:21 ` [PATCH 8/8] loopfs: only show devices in their correct instance Christian Brauner
2020-04-08 16:24 ` [PATCH 0/8] loopfs Jann Horn
2020-04-08 16:41   ` Stéphane Graber
2020-04-09  7:02     ` Dmitry Vyukov

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git