netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] loopfs
@ 2020-04-22 14:54 Christian Brauner
  2020-04-22 14:54 ` [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Hey everyone,

This is v2 of loopfs.

I've added a few more people to the Cc that want to make use of this and
I've added the missing ucount part that David pointed out and expanded a
little more on how this is used so this is used.

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Thanks!
Christian

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: https://github.com/kubernetes-sigs/kind/issues/1333
     https://github.com/kubernetes-sigs/kind/issues/1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     https://github.com/moby/moby/issues/27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813

Christian Brauner (7):
  kobject_uevent: remove unneeded netlink_ns check
  loopfs: implement loopfs
  loop: use ns_capable for some loop operations
  kernfs: handle multiple namespace tags
  loop: preserve sysfs backwards compatibility
  loopfs: start attaching correct namespace during loop_add()
  loopfs: only show devices in their correct instance

 Documentation/filesystems/sysfs-tagging.txt |   1 -
 MAINTAINERS                                 |   5 +
 block/genhd.c                               |  79 ++++
 drivers/base/devtmpfs.c                     |   4 +-
 drivers/block/Kconfig                       |   4 +
 drivers/block/Makefile                      |   1 +
 drivers/block/loop.c                        | 226 +++++++--
 drivers/block/loop.h                        |  12 +-
 drivers/block/loopfs/Makefile               |   3 +
 drivers/block/loopfs/loopfs.c               | 494 ++++++++++++++++++++
 drivers/block/loopfs/loopfs.h               |  36 ++
 fs/kernfs/dir.c                             |  38 +-
 fs/kernfs/kernfs-internal.h                 |  33 +-
 fs/kernfs/mount.c                           |  11 +-
 fs/sysfs/mount.c                            |  14 +-
 include/linux/device.h                      |   3 +
 include/linux/genhd.h                       |   3 +
 include/linux/kernfs.h                      |  44 +-
 include/linux/kobject_ns.h                  |   7 +-
 include/linux/sysfs.h                       |   8 +-
 include/linux/user_namespace.h              |   3 +
 include/uapi/linux/magic.h                  |   1 +
 kernel/ucount.c                             |   3 +
 lib/kobject.c                               |  17 +-
 lib/kobject_uevent.c                        |   2 +-
 net/core/net-sysfs.c                        |   6 -
 26 files changed, 953 insertions(+), 105 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h


base-commit: ae83d0b416db002fe95601e7f97f64b59514d936
-- 
2.26.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-22 16:34   ` Greg Kroah-Hartman
  2020-04-22 14:54 ` [PATCH v2 2/7] loopfs: implement loopfs Christian Brauner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Back when I rewrote large chunks of uevent sending I should have removed
the .netlink_ns method completely after having removed it's last user in
[1]. Let's remove it now and also remove the helper associated with it
that is unused too.

Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
---
 Documentation/filesystems/sysfs-tagging.txt |  1 -
 include/linux/kobject_ns.h                  |  3 ---
 lib/kobject.c                               | 13 -------------
 lib/kobject_uevent.c                        |  2 +-
 net/core/net-sysfs.c                        |  6 ------
 5 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
index c7c8e6438958..51d28dd8b84f 100644
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ b/Documentation/filesystems/sysfs-tagging.txt
@@ -37,6 +37,5 @@ Users of this interface:
 - define a type in the kobj_ns_type enumeration.
 - call kobj_ns_type_register() with its kobj_ns_type_operations which has
   - current_ns() which returns current's namespace
-  - netlink_ns() which returns a socket's namespace
   - initial_ns() which returns the initial namesapce
 - call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 069aa2ebef90..991a9286bcea 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -32,7 +32,6 @@ enum kobj_ns_type {
 /*
  * Callbacks so sysfs can determine namespaces
  *   @grab_current_ns: return a new reference to calling task's namespace
- *   @netlink_ns: return namespace to which a sock belongs (right?)
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
  */
@@ -40,7 +39,6 @@ struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
 	bool (*current_may_mount)(void);
 	void *(*grab_current_ns)(void);
-	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
 };
@@ -52,7 +50,6 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
 bool kobj_ns_current_may_mount(enum kobj_ns_type type);
 void *kobj_ns_grab_current(enum kobj_ns_type type);
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
 void kobj_ns_drop(enum kobj_ns_type type, void *ns);
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 83198cb37d8d..6f07083cc111 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -1092,19 +1092,6 @@ void *kobj_ns_grab_current(enum kobj_ns_type type)
 }
 EXPORT_SYMBOL_GPL(kobj_ns_grab_current);
 
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk)
-{
-	const void *ns = NULL;
-
-	spin_lock(&kobj_ns_type_lock);
-	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
-	    kobj_ns_ops_tbl[type])
-		ns = kobj_ns_ops_tbl[type]->netlink_ns(sk);
-	spin_unlock(&kobj_ns_type_lock);
-
-	return ns;
-}
-
 const void *kobj_ns_initial(enum kobj_ns_type type)
 {
 	const void *ns = NULL;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 7998affa45d4..a45b3eeaa2b9 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -400,7 +400,7 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 	 * are the only tag relevant here since we want to decide which
 	 * network namespaces to broadcast the uevent into.
 	 */
-	if (ops && ops->netlink_ns && kobj->ktype->namespace)
+	if (ops && kobj->ktype->namespace)
 		if (ops->type == KOBJ_NS_TYPE_NET)
 			net = kobj->ktype->namespace(kobj);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4773ad6ec111..3fa35a3c843a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1694,16 +1694,10 @@ static const void *net_initial_ns(void)
 	return &init_net;
 }
 
-static const void *net_netlink_ns(struct sock *sk)
-{
-	return sock_net(sk);
-}
-
 const struct kobj_ns_type_operations net_ns_type_operations = {
 	.type = KOBJ_NS_TYPE_NET,
 	.current_may_mount = net_current_may_mount,
 	.grab_current_ns = net_grab_current_ns,
-	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
 	.drop_ns = net_drop_ns,
 };
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/7] loopfs: implement loopfs
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
  2020-04-22 14:54 ` [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-22 21:52   ` Serge E. Hallyn
  2020-04-22 14:54 ` [PATCH v2 3/7] loop: use ns_capable for some loop operations Christian Brauner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: https://github.com/kubernetes-sigs/kind/issues/1333
     https://github.com/kubernetes-sigs/kind/issues/1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     https://github.com/moby/moby/issues/27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steve Barber <smbarber@google.com>
Cc: Filipe Brandenburger <filbranden@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benjamin Elder <bentheelder@google.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dylan Reid <dgreid@google.com>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Akihiro Suda <suda.kyoto@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- David Rheinsberg <david.rheinsberg@gmail.com> /
  Christian Brauner <christian.brauner@ubuntu.com>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.
---
 MAINTAINERS                    |   5 +
 drivers/block/Kconfig          |   4 +
 drivers/block/Makefile         |   1 +
 drivers/block/loop.c           | 200 ++++++++++---
 drivers/block/loop.h           |  12 +-
 drivers/block/loopfs/Makefile  |   3 +
 drivers/block/loopfs/loopfs.c  | 494 +++++++++++++++++++++++++++++++++
 drivers/block/loopfs/loopfs.h  |  36 +++
 include/linux/user_namespace.h |   3 +
 include/uapi/linux/magic.h     |   1 +
 kernel/ucount.c                |   3 +
 11 files changed, 721 insertions(+), 41 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b816a453b10e..560b37a65bce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9957,6 +9957,11 @@ W:	http://www.avagotech.com/support/
 F:	drivers/message/fusion/
 F:	drivers/scsi/mpt3sas/
 
+LOOPFS FILE SYSTEM
+M:	Christian Brauner <christian.brauner@ubuntu.com>
+S:	Supported
+F:	drivers/block/loopfs/
+
 LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers
 M:	Matthew Wilcox <willy@infradead.org>
 L:	linux-scsi@vger.kernel.org
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 025b1b77b11a..d7ff37d795ad 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -214,6 +214,10 @@ config BLK_DEV_LOOP
 
 	  Most users will answer N here.
 
+config BLK_DEV_LOOPFS
+	bool "Loopback device virtual filesystem support"
+	depends on BLK_DEV_LOOP=y
+
 config BLK_DEV_LOOP_MIN_COUNT
 	int "Number of loop devices to pre-create at init time"
 	depends on BLK_DEV_LOOP
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 795facd8cf19..7052be26aa8b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
 obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
 obj-$(CONFIG_BLK_DEV_RBD)     += rbd.o
 obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_ZRAM) += zram/
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..52f7583dd17d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,6 +81,10 @@
 
 #include "loop.h"
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+#include "loopfs/loopfs.h"
+#endif
+
 #include <linux/uaccess.h>
 
 static DEFINE_IDR(loop_index_idr);
@@ -1115,6 +1119,24 @@ loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer,
 	return err;
 }
 
+static void loop_remove(struct loop_device *lo)
+{
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	loopfs_remove(lo);
+#endif
+	del_gendisk(lo->lo_disk);
+	blk_cleanup_queue(lo->lo_queue);
+	blk_mq_free_tag_set(&lo->tag_set);
+	put_disk(lo->lo_disk);
+	kfree(lo);
+}
+
+static inline void __loop_remove(struct loop_device *lo)
+{
+	idr_remove(&loop_index_idr, lo->lo_number);
+	loop_remove(lo);
+}
+
 static int __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp = NULL;
@@ -1164,7 +1186,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	}
 	set_capacity(lo->lo_disk, 0);
 	loop_sysfs_exit(lo);
-	if (bdev) {
+	if (bdev && bdev->bd_disk) {
 		bd_set_size(bdev, 0);
 		/* let user-space know about this change */
 		kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
@@ -1174,7 +1196,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
-	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk;
 	lo_number = lo->lo_number;
 	loop_unprepare_queue(lo);
 out_unlock:
@@ -1213,7 +1235,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-	lo->lo_state = Lo_unbound;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (loopfs_wants_remove(lo))
+		__loop_remove(lo);
+	else
+#endif
+		lo->lo_state = Lo_unbound;
 	mutex_unlock(&loop_ctl_mutex);
 
 	/*
@@ -1259,6 +1286,74 @@ static int loop_clr_fd(struct loop_device *lo)
 	return __loop_clr_fd(lo, false);
 }
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+int loopfs_rundown_locked(struct loop_device *lo)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(!loopfs_device(lo)))
+		return -EINVAL;
+
+	ret = mutex_lock_killable(&loop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) {
+		ret = -EBUSY;
+	} else {
+		/*
+		 * Since the device is unbound it has no associated backing
+		 * file and we can safely set Lo_rundown to prevent it from
+		 * being found. Actual cleanup happens during inode eviction.
+		 */
+		lo->lo_state = Lo_rundown;
+		ret = 0;
+	}
+
+	mutex_unlock(&loop_ctl_mutex);
+	return ret;
+}
+
+/**
+ * loopfs_evict_locked() - remove loop device or mark inactive
+ * @lo:	loopfs loop device
+ *
+ * This function will remove a loop device. If it has no users
+ * and is bound the backing file will be cleaned up. If the loop
+ * device has users it will be marked for auto cleanup.
+ * This function is only called when a loopfs instance is shutdown
+ * when all references to it from this loopfs instance have been
+ * dropped. If there are still any references to it cleanup will
+ * happen in lo_release().
+ */
+void loopfs_evict_locked(struct loop_device *lo)
+{
+	struct lo_loopfs *lo_info;
+	struct inode *lo_inode;
+
+	WARN_ON_ONCE(!loopfs_device(lo));
+
+	mutex_lock(&loop_ctl_mutex);
+	lo_info = lo->lo_info;
+	lo_inode = lo_info->lo_inode;
+	lo_info->lo_inode = NULL;
+	lo_info->lo_flags |= LOOPFS_FLAGS_INACTIVE;
+
+	if (atomic_read(&lo->lo_refcnt) > 0) {
+		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
+	} else {
+		lo->lo_state = Lo_rundown;
+		lo->lo_disk->private_data = NULL;
+		lo_inode->i_private = NULL;
+
+		mutex_unlock(&loop_ctl_mutex);
+		__loop_clr_fd(lo, false);
+		return;
+	}
+	mutex_unlock(&loop_ctl_mutex);
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1842,7 +1937,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 
 	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
 		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
+			goto out_remove;
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&loop_ctl_mutex);
 		/*
@@ -1860,6 +1955,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
+out_remove:
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (lo->lo_state != Lo_bound && loopfs_wants_remove(lo))
+		__loop_remove(lo);
+#endif
+
 out_unlock:
 	mutex_unlock(&loop_ctl_mutex);
 }
@@ -1878,6 +1979,11 @@ static const struct block_device_operations lo_fops = {
  * And now the modules code and kernel interface.
  */
 static int max_loop;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+unsigned long max_devices;
+#else
+static unsigned long max_devices;
+#endif
 module_param(max_loop, int, 0444);
 MODULE_PARM_DESC(max_loop, "Maximum number of loop devices");
 module_param(max_part, int, 0444);
@@ -2006,7 +2112,7 @@ static const struct blk_mq_ops loop_mq_ops = {
 	.complete	= lo_complete_rq,
 };
 
-static int loop_add(struct loop_device **l, int i)
+static int loop_add(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	struct gendisk *disk;
@@ -2096,7 +2202,17 @@ static int loop_add(struct loop_device **l, int i)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+
 	add_disk(disk);
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	err = loopfs_add(lo, inode, disk_devt(disk));
+	if (err) {
+		__loop_remove(lo);
+		goto out;
+	}
+#endif
+
 	*l = lo;
 	return lo->lo_number;
 
@@ -2112,36 +2228,41 @@ static int loop_add(struct loop_device **l, int i)
 	return err;
 }
 
-static void loop_remove(struct loop_device *lo)
-{
-	del_gendisk(lo->lo_disk);
-	blk_cleanup_queue(lo->lo_queue);
-	blk_mq_free_tag_set(&lo->tag_set);
-	put_disk(lo->lo_disk);
-	kfree(lo);
-}
+struct find_free_cb_data {
+	struct loop_device **l;
+	struct inode *inode;
+};
 
 static int find_free_cb(int id, void *ptr, void *data)
 {
 	struct loop_device *lo = ptr;
-	struct loop_device **l = data;
+	struct find_free_cb_data *cb_data = data;
 
-	if (lo->lo_state == Lo_unbound) {
-		*l = lo;
-		return 1;
-	}
-	return 0;
+	if (lo->lo_state != Lo_unbound)
+		return 0;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (!loopfs_access(cb_data->inode, lo))
+		return 0;
+#endif
+
+	*cb_data->l = lo;
+	return 1;
 }
 
-static int loop_lookup(struct loop_device **l, int i)
+static int loop_lookup(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	int ret = -ENODEV;
 
 	if (i < 0) {
 		int err;
+		struct find_free_cb_data cb_data = {
+			.l = &lo,
+			.inode = inode,
+		};
 
-		err = idr_for_each(&loop_index_idr, &find_free_cb, &lo);
+		err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data);
 		if (err == 1) {
 			*l = lo;
 			ret = lo->lo_number;
@@ -2152,6 +2273,11 @@ static int loop_lookup(struct loop_device **l, int i)
 	/* lookup and return a specific i */
 	lo = idr_find(&loop_index_idr, i);
 	if (lo) {
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		if (!loopfs_access(inode, lo))
+			return -EACCES;
+#endif
+
 		*l = lo;
 		ret = lo->lo_number;
 	}
@@ -2166,9 +2292,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data)
 	int err;
 
 	mutex_lock(&loop_ctl_mutex);
-	err = loop_lookup(&lo, MINOR(dev) >> part_shift);
+	err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
-		err = loop_add(&lo, MINOR(dev) >> part_shift);
+		err = loop_add(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
 		kobj = NULL;
 	else
@@ -2192,15 +2318,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 	ret = -ENOSYS;
 	switch (cmd) {
 	case LOOP_CTL_ADD:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret >= 0) {
 			ret = -EEXIST;
 			break;
 		}
-		ret = loop_add(&lo, parm);
+		ret = loop_add(&lo, parm, file_inode(file));
 		break;
 	case LOOP_CTL_REMOVE:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret < 0)
 			break;
 		if (lo->lo_state != Lo_unbound) {
@@ -2212,14 +2338,13 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 		lo->lo_disk->private_data = NULL;
-		idr_remove(&loop_index_idr, lo->lo_number);
-		loop_remove(lo);
+		__loop_remove(lo);
 		break;
 	case LOOP_CTL_GET_FREE:
-		ret = loop_lookup(&lo, -1);
+		ret = loop_lookup(&lo, -1, file_inode(file));
 		if (ret >= 0)
 			break;
-		ret = loop_add(&lo, -1);
+		ret = loop_add(&lo, -1, file_inode(file));
 	}
 	mutex_unlock(&loop_ctl_mutex);
 
@@ -2246,7 +2371,6 @@ MODULE_ALIAS("devname:loop-control");
 static int __init loop_init(void)
 {
 	int i, nr;
-	unsigned long range;
 	struct loop_device *lo;
 	int err;
 
@@ -2285,10 +2409,10 @@ static int __init loop_init(void)
 	 */
 	if (max_loop) {
 		nr = max_loop;
-		range = max_loop << part_shift;
+		max_devices = max_loop << part_shift;
 	} else {
 		nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT;
-		range = 1UL << MINORBITS;
+		max_devices = 1UL << MINORBITS;
 	}
 
 	err = misc_register(&loop_misc);
@@ -2301,13 +2425,13 @@ static int __init loop_init(void)
 		goto misc_out;
 	}
 
-	blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
+	blk_register_region(MKDEV(LOOP_MAJOR, 0), max_devices,
 				  THIS_MODULE, loop_probe, NULL, NULL);
 
 	/* pre-create number of devices given by config or max_loop */
 	mutex_lock(&loop_ctl_mutex);
 	for (i = 0; i < nr; i++)
-		loop_add(&lo, i);
+		loop_add(&lo, i, NULL);
 	mutex_unlock(&loop_ctl_mutex);
 
 	printk(KERN_INFO "loop: module loaded\n");
@@ -2329,14 +2453,10 @@ static int loop_exit_cb(int id, void *ptr, void *data)
 
 static void __exit loop_exit(void)
 {
-	unsigned long range;
-
-	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
-
 	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
 	idr_destroy(&loop_index_idr);
 
-	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
+	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), max_devices);
 	unregister_blkdev(LOOP_MAJOR, "loop");
 
 	misc_deregister(&loop_misc);
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index af75a5ee4094..6fed746b6124 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -17,6 +17,10 @@
 #include <linux/kthread.h>
 #include <uapi/linux/loop.h>
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+#include "loopfs/loopfs.h"
+#endif
+
 /* Possible states of device */
 enum {
 	Lo_unbound,
@@ -62,6 +66,9 @@ struct loop_device {
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct lo_loopfs	*lo_info;
+#endif
 };
 
 struct loop_cmd {
@@ -89,6 +96,9 @@ struct loop_func_table {
 }; 
 
 int loop_register_transfer(struct loop_func_table *funcs);
-int loop_unregister_transfer(int number); 
+int loop_unregister_transfer(int number);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+extern unsigned long max_devices;
+#endif
 
 #endif
diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile
new file mode 100644
index 000000000000..87ec703b662e
--- /dev/null
+++ b/drivers/block/loopfs/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+loopfs-y			:= loopfs.o
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs.o
diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
new file mode 100644
index 000000000000..b3461c72b6e7
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.c
@@ -0,0 +1,494 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/fs.h>
+#include <linux/fs_parser.h>
+#include <linux/fsnotify.h>
+#include <linux/genhd.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/magic.h>
+#include <linux/major.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include "../loop.h"
+#include "loopfs.h"
+
+#define FIRST_INODE 1
+#define SECOND_INODE 2
+#define INODE_OFFSET 3
+
+enum loopfs_param {
+	Opt_max,
+};
+
+const struct fs_parameter_spec loopfs_fs_parameters[] = {
+	fsparam_u32("max",	Opt_max),
+	{}
+};
+
+struct loopfs_mount_opts {
+	int max;
+};
+
+struct loopfs_info {
+	kuid_t root_uid;
+	kgid_t root_gid;
+	unsigned long device_count;
+	struct dentry *control_dentry;
+	struct loopfs_mount_opts mount_opts;
+};
+
+static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+struct super_block *loopfs_i_sb(const struct inode *inode)
+{
+	if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC)
+		return inode->i_sb;
+
+	return NULL;
+}
+
+bool loopfs_device(const struct loop_device *lo)
+{
+	return lo->lo_info != NULL;
+}
+
+struct user_namespace *loopfs_ns(const struct loop_device *lo)
+{
+	if (loopfs_device(lo)) {
+		struct super_block *sb;
+
+		sb = loopfs_i_sb(lo->lo_info->lo_inode);
+		if (sb)
+			return sb->s_user_ns;
+	}
+
+	return &init_user_ns;
+}
+
+bool loopfs_access(const struct inode *first, struct loop_device *lo)
+{
+	return loopfs_device(lo) &&
+	       loopfs_i_sb(first) == loopfs_i_sb(lo->lo_info->lo_inode);
+}
+
+bool loopfs_wants_remove(const struct loop_device *lo)
+{
+	return lo->lo_info && (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE);
+}
+
+/**
+ * loopfs_add - allocate inode from super block of a loopfs mount
+ * @lo:		loop device for which we are creating a new device entry
+ * @ref_inode:	inode from wich the super block will be taken
+ * @device_nr:  device number of the associated disk device
+ *
+ * This function creates a new device node for @lo.
+ * Minor numbers are limited and tracked globally. The
+ * function will stash a struct loop_device for the specific loop
+ * device in i_private of the inode.
+ * It will go on to allocate a new inode from the super block of the
+ * filesystem mount, stash a struct loop_device in its i_private field
+ * and attach a dentry to that inode.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+int loopfs_add(struct loop_device *lo, struct inode *ref_inode, dev_t device_nr)
+{
+	int ret;
+	char name[DISK_NAME_LEN];
+	struct super_block *sb;
+	struct loopfs_info *info;
+	struct dentry *root, *dentry;
+	struct inode *inode;
+	struct lo_loopfs *lo_info;
+
+	sb = loopfs_i_sb(ref_inode);
+	if (!sb)
+		return 0;
+
+	if (MAJOR(device_nr) != LOOP_MAJOR)
+		return -EINVAL;
+
+	lo_info = kzalloc(sizeof(struct lo_loopfs), GFP_KERNEL);
+	if (!lo_info) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	info = LOOPFS_SB(sb);
+	if ((info->device_count + 1) > info->mount_opts.max) {
+		ret = -ENOSPC;
+		goto err;
+	}
+
+	lo_info->lo_ucount = inc_ucount(sb->s_user_ns,
+					info->root_uid, UCOUNT_LOOP_DEVICES);
+	if (!lo_info->lo_ucount) {
+		ret = -ENOSPC;
+		goto err;
+	}
+
+	if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	inode = new_inode(sb);
+	if (!inode) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop device in this loopfs instance is opened.
+	 */
+	inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+	init_special_inode(inode, S_IFBLK | 0600, device_nr);
+
+	root = sb->s_root;
+	inode_lock(d_inode(root));
+	/* look it up */
+	dentry = lookup_one_len(name, root, strlen(name));
+	if (IS_ERR(dentry)) {
+		inode_unlock(d_inode(root));
+		iput(inode);
+		ret = PTR_ERR(dentry);
+		goto err;
+	}
+
+	if (d_really_is_positive(dentry)) {
+		/* already exists */
+		dput(dentry);
+		inode_unlock(d_inode(root));
+		iput(inode);
+		ret = -EEXIST;
+		goto err;
+	}
+
+	d_instantiate(dentry, inode);
+	fsnotify_create(d_inode(root), dentry);
+	inode_unlock(d_inode(root));
+
+	lo_info->lo_inode = inode;
+	lo->lo_info = lo_info;
+	inode->i_private = lo;
+	info->device_count++;
+
+	return 0;
+
+err:
+	if (lo_info->lo_ucount)
+		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
+	kfree(lo_info);
+	return ret;
+}
+
+void loopfs_remove(struct loop_device *lo)
+{
+	struct lo_loopfs *lo_info = lo->lo_info;
+	struct inode *inode;
+	struct super_block *sb;
+	struct dentry *root, *dentry;
+
+	if (!lo_info)
+		return;
+
+	inode = lo_info->lo_inode;
+	if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR)
+		goto out;
+
+	sb = loopfs_i_sb(inode);
+	lo_info->lo_inode = NULL;
+
+	/*
+	 * The root dentry is always the parent dentry since we don't allow
+	 * creation of directories.
+	 */
+	root = sb->s_root;
+
+	inode_lock(d_inode(root));
+	dentry = d_find_any_alias(inode);
+	if (dentry && simple_positive(dentry)) {
+		simple_unlink(d_inode(root), dentry);
+		d_delete(dentry);
+	}
+	dput(dentry);
+	inode_unlock(d_inode(root));
+	LOOPFS_SB(sb)->device_count--;
+
+out:
+	if (lo_info->lo_ucount)
+		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
+	kfree(lo->lo_info);
+	lo->lo_info = NULL;
+}
+
+static void loopfs_fs_context_free(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+
+	kfree(ctx);
+}
+
+/**
+ * loopfs_loop_ctl_create - create a new loop-control device
+ * @sb: super block of the loopfs mount
+ *
+ * This function creates a new loop-control device node in the loopfs mount
+ * referred to by @sb.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static int loopfs_loop_ctl_create(struct super_block *sb)
+{
+	struct dentry *dentry;
+	struct inode *inode = NULL;
+	struct dentry *root = sb->s_root;
+	struct loopfs_info *info = sb->s_fs_info;
+
+	if (info->control_dentry)
+		return 0;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = SECOND_INODE;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	init_special_inode(inode, S_IFCHR | 0600,
+			   MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR));
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop-control device in this loopfs instance is opened.
+	 */
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+
+	dentry = d_alloc_name(root, "loop-control");
+	if (!dentry) {
+		iput(inode);
+		return -ENOMEM;
+	}
+
+	info->control_dentry = dentry;
+	d_add(dentry, inode);
+
+	return 0;
+}
+
+static inline bool is_loopfs_control_device(const struct dentry *dentry)
+{
+	return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry;
+}
+
+static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			 struct inode *new_dir, struct dentry *new_dentry,
+			 unsigned int flags)
+{
+	if (is_loopfs_control_device(old_dentry) ||
+	    is_loopfs_control_device(new_dentry))
+		return -EPERM;
+
+	return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
+}
+
+static int loopfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int ret;
+	struct loop_device *lo;
+
+	if (is_loopfs_control_device(dentry))
+		return -EPERM;
+
+	lo = d_inode(dentry)->i_private;
+	ret = loopfs_rundown_locked(lo);
+	if (ret)
+		return ret;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations loopfs_dir_inode_operations = {
+	.lookup = simple_lookup,
+	.rename = loopfs_rename,
+	.unlink = loopfs_unlink,
+};
+
+static void loopfs_evict_inode(struct inode *inode)
+{
+	struct loop_device *lo = inode->i_private;
+
+	clear_inode(inode);
+
+	if (lo && S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) {
+		loopfs_evict_locked(lo);
+		LOOPFS_SB(inode->i_sb)->device_count--;
+		inode->i_private = NULL;
+	}
+}
+
+static int loopfs_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct loopfs_info *info = LOOPFS_SB(root->d_sb);
+
+	if (info->mount_opts.max <= max_devices)
+		seq_printf(seq, ",max=%d", info->mount_opts.max);
+
+	return 0;
+}
+
+static void loopfs_put_super(struct super_block *sb)
+{
+	struct loopfs_info *info = sb->s_fs_info;
+
+	sb->s_fs_info = NULL;
+	kfree(info);
+}
+
+static const struct super_operations loopfs_super_ops = {
+	.evict_inode    = loopfs_evict_inode,
+	.show_options	= loopfs_show_options,
+	.statfs         = simple_statfs,
+	.put_super	= loopfs_put_super,
+};
+
+static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct loopfs_info *info;
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct inode *inode = NULL;
+
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+
+	sb->s_iflags &= ~SB_I_NODEV;
+	sb->s_iflags |= SB_I_NOEXEC;
+	sb->s_magic = LOOPFS_SUPER_MAGIC;
+	sb->s_op = &loopfs_super_ops;
+	sb->s_time_gran = 1;
+
+	sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL);
+	if (!sb->s_fs_info)
+		return -ENOMEM;
+	info = sb->s_fs_info;
+
+	info->root_gid = make_kgid(sb->s_user_ns, 0);
+	if (!gid_valid(info->root_gid))
+		info->root_gid = GLOBAL_ROOT_GID;
+	info->root_uid = make_kuid(sb->s_user_ns, 0);
+	if (!uid_valid(info->root_uid))
+		info->root_uid = GLOBAL_ROOT_UID;
+	info->mount_opts.max = ctx->max;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = FIRST_INODE;
+	inode->i_fop = &simple_dir_operations;
+	inode->i_mode = S_IFDIR | 0755;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_op = &loopfs_dir_inode_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return loopfs_loop_ctl_create(sb);
+}
+
+static int loopfs_fs_context_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, loopfs_fill_super);
+}
+
+static int loopfs_fs_context_parse_param(struct fs_context *fc,
+					 struct fs_parameter *param)
+{
+	int opt;
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct fs_parse_result result;
+
+	opt = fs_parse(fc, loopfs_fs_parameters, param, &result);
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_max:
+		if (result.uint_32 > max_devices)
+			return invalfc(fc, "Bad value for '%s'", param->key);
+
+		ctx->max = result.uint_32;
+		break;
+	default:
+		return invalfc(fc, "Unsupported parameter '%s'", param->key);
+	}
+
+	return 0;
+}
+
+static int loopfs_fs_context_reconfigure(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+	struct loopfs_info *info = LOOPFS_SB(fc->root->d_sb);
+
+	info->mount_opts.max = ctx->max;
+	return 0;
+}
+
+static const struct fs_context_operations loopfs_fs_context_ops = {
+	.free		= loopfs_fs_context_free,
+	.get_tree	= loopfs_fs_context_get_tree,
+	.parse_param	= loopfs_fs_context_parse_param,
+	.reconfigure	= loopfs_fs_context_reconfigure,
+};
+
+static int loopfs_init_fs_context(struct fs_context *fc)
+{
+	struct loopfs_mount_opts *ctx = fc->fs_private;
+
+	ctx = kzalloc(sizeof(struct loopfs_mount_opts), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->max = max_devices;
+
+	fc->fs_private = ctx;
+
+	fc->ops = &loopfs_fs_context_ops;
+
+	return 0;
+}
+
+static struct file_system_type loop_fs_type = {
+	.name			= "loop",
+	.init_fs_context	= loopfs_init_fs_context,
+	.parameters		= loopfs_fs_parameters,
+	.kill_sb		= kill_litter_super,
+	.fs_flags		= FS_USERNS_MOUNT,
+};
+
+int __init init_loopfs(void)
+{
+	init_user_ns.ucount_max[UCOUNT_LOOP_DEVICES] = 255;
+	return register_filesystem(&loop_fs_type);
+}
+
+module_init(init_loopfs);
+MODULE_AUTHOR("Christian Brauner <christian.brauner@ubuntu.com>");
+MODULE_DESCRIPTION("Loop device filesystem");
diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
new file mode 100644
index 000000000000..2ee114aa3fa9
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_LOOPFS_FS_H
+#define _LINUX_LOOPFS_FS_H
+
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/user_namespace.h>
+
+struct loop_device;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+
+#define LOOPFS_FLAGS_INACTIVE (1 << 0)
+
+struct lo_loopfs {
+	struct ucounts *lo_ucount;
+	struct inode *lo_inode;
+	int lo_flags;
+};
+
+extern struct super_block *loopfs_i_sb(const struct inode *inode);
+extern bool loopfs_device(const struct loop_device *lo);
+extern struct user_namespace *loopfs_ns(const struct loop_device *lo);
+extern bool loopfs_access(const struct inode *first, struct loop_device *lo);
+extern int loopfs_add(struct loop_device *lo, struct inode *ref_inode,
+		      dev_t device_nr);
+extern void loopfs_remove(struct loop_device *lo);
+extern bool loopfs_wants_remove(const struct loop_device *lo);
+extern void loopfs_evict_locked(struct loop_device *lo);
+extern int loopfs_rundown_locked(struct loop_device *lo);
+
+#endif
+
+#endif /* _LINUX_LOOPFS_FS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..04a4891765c0 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -49,6 +49,9 @@ enum ucount_type {
 #ifdef CONFIG_INOTIFY_USER
 	UCOUNT_INOTIFY_INSTANCES,
 	UCOUNT_INOTIFY_WATCHES,
+#endif
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	UCOUNT_LOOP_DEVICES,
 #endif
 	UCOUNT_COUNTS,
 };
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..0817d093a012 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -75,6 +75,7 @@
 #define BINFMTFS_MAGIC          0x42494e4d
 #define DEVPTS_SUPER_MAGIC	0x1cd1
 #define BINDERFS_SUPER_MAGIC	0x6c6f6f70
+#define LOOPFS_SUPER_MAGIC	0x6c6f6f71
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define PIPEFS_MAGIC            0x50495045
 #define PROC_SUPER_MAGIC	0x9fa0
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..fb0f6394a8bb 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -73,6 +73,9 @@ static struct ctl_table user_table[] = {
 #ifdef CONFIG_INOTIFY_USER
 	UCOUNT_ENTRY("max_inotify_instances"),
 	UCOUNT_ENTRY("max_inotify_watches"),
+#endif
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	UCOUNT_ENTRY("max_loop_devices"),
 #endif
 	{ }
 };
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/7] loop: use ns_capable for some loop operations
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
  2020-04-22 14:54 ` [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
  2020-04-22 14:54 ` [PATCH v2 2/7] loopfs: implement loopfs Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-22 20:50   ` Serge E. Hallyn
  2020-04-22 14:54 ` [PATCH v2 4/7] kernfs: handle multiple namespace tags Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

The following  LOOP_GET_STATUS, LOOP_SET_STATUS, and LOOP_SET_BLOCK_SIZE
operations are now allowed in non-initial namespaces. Most other
operations were already possible before.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Adapated loop_capable() based on changes in the loopfs
    implementation patchset. Otherwise it is functionally equivalent to
    the v1 version.
---
 drivers/block/loop.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 52f7583dd17d..8e21d4b33e01 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1352,6 +1352,16 @@ void loopfs_evict_locked(struct loop_device *lo)
 	}
 	mutex_unlock(&loop_ctl_mutex);
 }
+
+static bool loop_capable(const struct loop_device *lo, int cap)
+{
+	return ns_capable(loopfs_ns(lo), cap);
+}
+#else /* !CONFIG_BLK_DEV_LOOPFS */
+static inline bool loop_capable(const struct loop_device *lo, int cap)
+{
+	return capable(cap);
+}
 #endif /* CONFIG_BLK_DEV_LOOPFS */
 
 static int
@@ -1368,7 +1378,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return err;
 	if (lo->lo_encrypt_key_size &&
 	    !uid_eq(lo->lo_key_owner, uid) &&
-	    !capable(CAP_SYS_ADMIN)) {
+	    !loop_capable(lo, CAP_SYS_ADMIN)) {
 		err = -EPERM;
 		goto out_unlock;
 	}
@@ -1499,7 +1509,7 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 	memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE);
 	info->lo_encrypt_type =
 		lo->lo_encryption ? lo->lo_encryption->number : 0;
-	if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) {
+	if (lo->lo_encrypt_key_size && loop_capable(lo, CAP_SYS_ADMIN)) {
 		info->lo_encrypt_key_size = lo->lo_encrypt_key_size;
 		memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
 		       lo->lo_encrypt_key_size);
@@ -1723,7 +1733,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_clr_fd(lo);
 	case LOOP_SET_STATUS:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status_old(lo,
 					(struct loop_info __user *)arg);
 		}
@@ -1732,7 +1742,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_get_status_old(lo, (struct loop_info __user *) arg);
 	case LOOP_SET_STATUS64:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status64(lo,
 					(struct loop_info64 __user *) arg);
 		}
@@ -1742,7 +1752,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 	case LOOP_SET_CAPACITY:
 	case LOOP_SET_DIRECT_IO:
 	case LOOP_SET_BLOCK_SIZE:
-		if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
+		if (!(mode & FMODE_WRITE) && !loop_capable(lo, CAP_SYS_ADMIN))
 			return -EPERM;
 		/* Fall through */
 	default:
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/7] kernfs: handle multiple namespace tags
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
                   ` (2 preceding siblings ...)
  2020-04-22 14:54 ` [PATCH v2 3/7] loop: use ns_capable for some loop operations Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-22 22:01   ` Serge E. Hallyn
  2020-04-22 14:54 ` [PATCH v2 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Since [1] kernfs supports namespace tags. This feature is essential to
enable sysfs to present different views of on various parts depending on
the namespace tag. For example, the /sys/class/net/ directory will only
show network devices that belong to the network namespace that sysfs was
mounted in. This is achieved by stashing a reference to the network
namespace of the task mounting sysfs in the super block. And when a
lookup operation is performed on e.g. /sys/class/net/ kernfs will
compare the network namespace tag of the kernfs_node associated with the
device and kobject of the network device to the network namespace of the
network device. This ensures that only network devices owned by the
network namespace sysfs was mounted in are shown, a feature which is
essential to containers.
For loopfs to show correct permissions in sysfs just as with network
devices we need to be able to tag kernfs_super_info with additional
namespaces. This extension was even already mentioned in a comment to
struct kernfs_super_info:
  /*
   * Each sb is associated with one namespace tag, currently the
   * network namespace of the task which mounted this kernfs
   * instance.  If multiple tags become necessary, make the following
   * an array and compare kernfs_node tag against every entry.
   */
This patch extends the kernfs_super_info and kernfs_fs_context ns
pointers to fixed-size arrays of namespace tags. The size is taken from
the namespaces currently supported by kobjects, i.e. we don't extend it
to cover all namespace but only the ones kernfs needs to support.
In addition, the kernfs_node struct gains an additional member that
indicates the type of namespace this kernfs_node was tagged with. This
allows us to simply retrieve the correct namespace tag from the
kernfs_fs_context and kernfs_super_info ns array with a simple indexing
operation. This has the advantage that we can just keep passing down the
correct namespace instead of passing down the array.

[1]: 608b4b9548de ("netns: Teach network device kobjects which namespace they are in.")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
---
 fs/kernfs/dir.c             |  6 +++---
 fs/kernfs/kernfs-internal.h |  9 ++++-----
 fs/kernfs/mount.c           | 11 +++++++----
 fs/sysfs/mount.c            | 10 +++++-----
 include/linux/kernfs.h      | 22 ++++++++++++++--------
 include/linux/sysfs.h       |  8 +++++---
 lib/kobject.c               |  2 +-
 7 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..1f2d894ae454 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -576,7 +576,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 
 	/* The kernfs node has been moved to a different namespace */
 	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns != kn->ns)
+	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
 		goto out_bad;
 
 	mutex_unlock(&kernfs_mutex);
@@ -1087,7 +1087,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dir->i_sb)->ns;
+		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
 
@@ -1673,7 +1673,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dentry->d_sb)->ns;
+		ns = kernfs_info(dentry->d_sb)->ns[parent->ns_type];
 
 	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
 	     pos;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7ee97ef59184..7c972c00f84a 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -16,6 +16,7 @@
 #include <linux/xattr.h>
 
 #include <linux/kernfs.h>
+#include <linux/kobject_ns.h>
 #include <linux/fs_context.h>
 
 struct kernfs_iattrs {
@@ -62,12 +63,10 @@ struct kernfs_super_info {
 	struct kernfs_root	*root;
 
 	/*
-	 * Each sb is associated with one namespace tag, currently the
-	 * network namespace of the task which mounted this kernfs
-	 * instance.  If multiple tags become necessary, make the following
-	 * an array and compare kernfs_node tag against every entry.
+	 * Each sb can be associated with namespace tags. They will be used
+	 * to compare kernfs_node tags against relevant entries.
 	 */
-	const void		*ns;
+	const void		*ns[KOBJ_NS_TYPES];
 
 	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
 	struct list_head	node;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10..dc4ee0f0a597 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -279,14 +279,15 @@ static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
 	struct kernfs_super_info *sb_info = kernfs_info(sb);
 	struct kernfs_super_info *info = fc->s_fs_info;
 
-	return sb_info->root == info->root && sb_info->ns == info->ns;
+	return sb_info->root == info->root &&
+	       memcmp(sb_info->ns, info->ns, sizeof(sb_info->ns)) == 0;
 }
 
 static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	kfc->ns_tag = NULL;
+	memset(kfc->ns_tag, 0, sizeof(kfc->ns_tag));
 	return set_anon_super_fc(sb, fc);
 }
 
@@ -296,7 +297,7 @@ static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
  *
  * Return the namespace tag associated with kernfs super_block @sb.
  */
-const void *kernfs_super_ns(struct super_block *sb)
+const void **kernfs_super_ns(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 
@@ -324,7 +325,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		return -ENOMEM;
 
 	info->root = kfc->root;
-	info->ns = kfc->ns_tag;
+	BUILD_BUG_ON(sizeof(info->ns) != sizeof(kfc->ns_tag));
+	memcpy(info->ns, kfc->ns_tag, sizeof(info->ns));
+
 	INIT_LIST_HEAD(&info->node);
 
 	fc->s_fs_info = info;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index db81cfbab9d6..5e2ec88a709e 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -41,8 +41,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	if (kfc->ns_tag)
-		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
+		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -66,7 +66,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 	if (!kfc)
 		return -ENOMEM;
 
-	kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -81,10 +81,10 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 
 static void sysfs_kill_sb(struct super_block *sb)
 {
-	void *ns = (void *)kernfs_super_ns(sb);
+	void **ns = (void **)kernfs_super_ns(sb);
 
 	kernfs_kill_sb(sb);
-	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
+	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
 }
 
 static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 89f6a4214a70..d0544f2e0c99 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
 #include <linux/atomic.h>
 #include <linux/uidgid.h>
 #include <linux/wait.h>
+#include <linux/kobject_ns.h>
 
 struct file;
 struct dentry;
@@ -137,8 +138,9 @@ struct kernfs_node {
 
 	struct rb_node		rb;
 
-	const void		*ns;	/* namespace tag */
-	unsigned int		hash;	/* ns + name hash */
+	const void		*ns;		/* namespace tag */
+	enum kobj_ns_type	ns_type;	/* type of namespace tag */
+	unsigned int		hash;		/* ns + name hash */
 	union {
 		struct kernfs_elem_dir		dir;
 		struct kernfs_elem_symlink	symlink;
@@ -275,7 +277,7 @@ struct kernfs_ops {
  */
 struct kernfs_fs_context {
 	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
-	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
+	void			*ns_tag[KOBJ_NS_TYPES]; /* Namespace tags of the mount (or empty) */
 	unsigned long		magic;		/* File system specific magic number */
 
 	/* The following are set/used by kernfs_mount() */
@@ -319,17 +321,20 @@ static inline ino_t kernfs_gen(struct kernfs_node *kn)
 
 /**
  * kernfs_enable_ns - enable namespace under a directory
- * @kn: directory of interest, should be empty
+ * @kn:		directory of interest, should be empty
+ * @ns_type:	type of namespace that should be enabled for this directory
  *
  * This is to be called right after @kn is created to enable namespace
  * under it.  All children of @kn must have non-NULL namespace tags and
  * only the ones which match the super_block's tag will be visible.
  */
-static inline void kernfs_enable_ns(struct kernfs_node *kn)
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type)
 {
 	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
 	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
 	kn->flags |= KERNFS_NS;
+	kn->ns_type = ns_type;
 }
 
 /**
@@ -401,7 +406,7 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 		     const void *value, size_t size, int flags);
 
-const void *kernfs_super_ns(struct super_block *sb);
+const void **kernfs_super_ns(struct super_block *sb);
 int kernfs_get_tree(struct fs_context *fc);
 void kernfs_free_fs_context(struct fs_context *fc);
 void kernfs_kill_sb(struct super_block *sb);
@@ -415,7 +420,8 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
 { return 0; }	/* whatever */
 
-static inline void kernfs_enable_ns(struct kernfs_node *kn) { }
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type) { }
 
 static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 { return false; }
@@ -511,7 +517,7 @@ static inline int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 				   const void *value, size_t size, int flags)
 { return -ENOSYS; }
 
-static inline const void *kernfs_super_ns(struct super_block *sb)
+static inline const void **kernfs_super_ns(struct super_block *sb)
 { return NULL; }
 
 static inline int kernfs_get_tree(struct fs_context *fc)
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 80bb865b3a33..d127b3487abc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -306,9 +306,10 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
 
 int __must_check sysfs_init(void);
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
-	return kernfs_enable_ns(kn);
+	return kernfs_enable_ns(kn, ns_type);
 }
 
 int sysfs_file_change_owner(struct kobject *kobj, const char *name, kuid_t kuid,
@@ -531,7 +532,8 @@ static inline int __must_check sysfs_init(void)
 	return 0;
 }
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
 }
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 6f07083cc111..c58c62d49a10 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -120,7 +120,7 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(ops->type >= KOBJ_NS_TYPES);
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
-		sysfs_enable_ns(kobj->sd);
+		sysfs_enable_ns(kobj->sd, ops->type);
 	}
 
 	return 0;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/7] loop: preserve sysfs backwards compatibility
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
                   ` (3 preceding siblings ...)
  2020-04-22 14:54 ` [PATCH v2 4/7] kernfs: handle multiple namespace tags Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-23  1:17   ` Serge E. Hallyn
  2020-04-22 14:54 ` [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
  2020-04-22 14:54 ` [PATCH v2 7/7] loopfs: only show devices in their correct instance Christian Brauner
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

For sysfs the initial namespace is special. All devices currently
propagate into all non-initial namespaces. For example, sysfs is usually
mounted in a privileged or unprivileged container and all devices are
visible to the container but are owned by global root. Even though none
of the propagated files can be used there are still a lot of read-only
values that are accessed or read by tools running in non-initial
namespaces. Some devices though, which can be moved or created in
another namespace, will only show up in the corresponding namespace.
This currently includes network and loop devices but no other ones.
Since all current workloads depend on devices from the inital namespace
being visible this behavior cannot be simply changed. This patch just
makes sure to keep propagating devices that share the same device class
with loop devices from the initial namespaces into all non-initial
namespaces as before. In short, nothing changes only loopfs loop devices
will be shown in their correct namespace.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Protect init_net with a CONFIG_NET ifdef in case it is set to "n".
  - As Tejun pointed out there is argument to be made that a new mount
    option for sysfs could be added that would change how devices are
    propagated. This patch does not prevent this but it is an orthogonal
    problem.
---
 block/genhd.c               | 79 +++++++++++++++++++++++++++++++++++++
 fs/kernfs/dir.c             | 34 +++++++++++++---
 fs/kernfs/kernfs-internal.h | 24 +++++++++++
 fs/sysfs/mount.c            |  4 ++
 include/linux/genhd.h       |  3 ++
 include/linux/kernfs.h      | 22 +++++++++++
 include/linux/kobject_ns.h  |  4 ++
 lib/kobject.c               |  2 +
 8 files changed, 167 insertions(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 06b642b23a07..b5b2601c4311 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1198,11 +1198,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
 	return NULL;
 }
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+static void *user_grab_current_ns(void)
+{
+	struct user_namespace *ns = current_user_ns();
+	return get_user_ns(ns);
+}
+
+static const void *user_initial_ns(void)
+{
+	return &init_user_ns;
+}
+
+static void user_put_ns(void *p)
+{
+	struct user_namespace *ns = p;
+	put_user_ns(ns);
+}
+
+static bool user_current_may_mount(void)
+{
+	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
+}
+
+const struct kobj_ns_type_operations user_ns_type_operations = {
+	.type			= KOBJ_NS_TYPE_USER,
+	.current_may_mount	= user_current_may_mount,
+	.grab_current_ns	= user_grab_current_ns,
+	.initial_ns		= user_initial_ns,
+	.drop_ns		= user_put_ns,
+};
+
+static const void *block_class_user_namespace(struct device *dev)
+{
+	struct gendisk *disk;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	return disk->user_ns;
+}
+
+static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
+{
+	struct gendisk *disk;
+	struct user_namespace *ns;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	ns = disk->user_ns;
+	if (ns && ns != &init_user_ns) {
+		kuid_t ns_root_uid = make_kuid(ns, 0);
+		kgid_t ns_root_gid = make_kgid(ns, 0);
+
+		if (uid_valid(ns_root_uid))
+			*uid = ns_root_uid;
+
+		if (gid_valid(ns_root_gid))
+			*gid = ns_root_gid;
+	}
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int __init genhd_device_init(void)
 {
 	int error;
 
 	block_class.dev_kobj = sysfs_dev_block_kobj;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	kobj_ns_type_register(&user_ns_type_operations);
+#endif
 	error = class_register(&block_class);
 	if (unlikely(error))
 		return error;
@@ -1524,8 +1594,14 @@ static void disk_release(struct device *dev)
 		blk_put_queue(disk->queue);
 	kfree(disk);
 }
+
 struct class block_class = {
 	.name		= "block",
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	.ns_type	= &user_ns_type_operations,
+	.namespace	= block_class_user_namespace,
+	.get_ownership	= block_class_get_ownership,
+#endif
 };
 
 static char *block_devnode(struct device *dev, umode_t *mode,
@@ -1715,6 +1791,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 		disk_to_dev(disk)->class = &block_class;
 		disk_to_dev(disk)->type = &disk_type;
 		device_initialize(disk_to_dev(disk));
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		disk->user_ns = &init_user_ns;
+#endif
 	}
 	return disk;
 }
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1f2d894ae454..02796ba6521a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved to a different namespace */
-	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
-		goto out_bad;
+	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
+		if (kernfs_init_ns_propagates(kn->parent) &&
+		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
+			goto out_good;
+		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
+			goto out_bad;
+	}
 
+out_good:
 	mutex_unlock(&kernfs_mutex);
 	return 1;
 out_bad:
@@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	if (!kn && kernfs_init_ns_propagates(parent)) {
+		ns = kernfs_init_ns(parent->ns_type);
+		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	}
 
 	/* no such entry */
 	if (!kn || !kernfs_active(kn)) {
@@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+	const void *init_ns;
+
 	if (pos) {
 		int valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
@@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		if (!valid)
 			pos = NULL;
 	}
+
+	if (kernfs_init_ns_propagates(parent))
+		init_ns = kernfs_init_ns(parent->ns_type);
+	else
+		init_ns = NULL;
+
 	if (!pos && (hash > 1) && (hash < INT_MAX)) {
 		struct rb_node *node = parent->dir.children.rb_node;
 		while (node) {
@@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 {
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
 	if (pos) {
+		const void *init_ns;
+		if (kernfs_init_ns_propagates(parent))
+			init_ns = kernfs_init_ns(parent->ns_type);
+		else
+			init_ns = NULL;
+
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
 			if (!node)
 				pos = NULL;
 			else
 				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
+		} while (pos && (!kernfs_active(pos) ||
+				 (pos->ns != ns && pos->ns != init_ns)));
 	}
 	return pos;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7c972c00f84a..74eb6c447361 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -80,6 +80,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
 	return d_inode(dentry)->i_private;
 }
 
+#ifdef CONFIG_NET
+extern struct net init_net;
+#endif
+
+extern struct user_namespace init_user_ns;
+
+static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
+{
+	switch (ns_type) {
+	case KOBJ_NS_TYPE_NET:
+#ifdef CONFIG_NET
+		return &init_net;
+#else
+		break;
+#endif
+	case KOBJ_NS_TYPE_USER:
+		return &init_user_ns;
+	default:
+		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
+	}
+
+	return NULL;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 5e2ec88a709e..99b82a0ae7ea 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 
 	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
 		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
+		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -85,6 +88,7 @@ static void sysfs_kill_sb(struct super_block *sb)
 
 	kernfs_kill_sb(sb);
 	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
+	kobj_ns_drop(KOBJ_NS_TYPE_USER, ns[KOBJ_NS_TYPE_USER]);
 }
 
 static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 9b3fffdf4011..4edf8a82e815 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -220,6 +220,9 @@ struct gendisk {
 	int node_id;
 	struct badblocks *bb;
 	struct lockdep_map lockdep_map;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct user_namespace *user_ns;
+#endif
 };
 
 static inline struct gendisk *part_to_disk(struct hd_struct *part)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d0544f2e0c99..7809584a7c8e 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -53,6 +53,7 @@ enum kernfs_node_flag {
 	KERNFS_SUICIDED		= 0x0800,
 	KERNFS_EMPTY_DIR	= 0x1000,
 	KERNFS_HAS_RELEASE	= 0x2000,
+	KERNFS_NS_PROPAGATE	= 0x4000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -337,6 +338,27 @@ static inline void kernfs_enable_ns(struct kernfs_node *kn,
 	kn->ns_type = ns_type;
 }
 
+static inline void kernfs_enable_init_ns_propagates(struct kernfs_node *kn)
+{
+	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
+	WARN_ON_ONCE(!(kn->flags & KERNFS_NS));
+	kn->flags |= KERNFS_NS_PROPAGATE;
+}
+
+/**
+ * kernfs_init_ns_propagates - test whether init ns propagates
+ * @kn: the node to test
+ *
+ * Test whether kernfs entries created in the init namespace propagate into
+ * other namespaces.
+ */
+static inline bool kernfs_init_ns_propagates(const struct kernfs_node *kn)
+{
+	return ((kn->flags & (KERNFS_NS | KERNFS_NS_PROPAGATE)) ==
+		(KERNFS_NS | KERNFS_NS_PROPAGATE));
+}
+
 /**
  * kernfs_ns_enabled - test whether namespace is enabled
  * @kn: the node to test
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 991a9286bcea..a9c45bcce235 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -26,6 +26,7 @@ struct kobject;
 enum kobj_ns_type {
 	KOBJ_NS_TYPE_NONE = 0,
 	KOBJ_NS_TYPE_NET,
+	KOBJ_NS_TYPE_USER,
 	KOBJ_NS_TYPES
 };
 
@@ -34,6 +35,8 @@ enum kobj_ns_type {
  *   @grab_current_ns: return a new reference to calling task's namespace
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
+ *   @initial_ns_propagates: whether devices in the initial namespace propagate
+ *			to all other namespaces
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
@@ -41,6 +44,7 @@ struct kobj_ns_type_operations {
 	void *(*grab_current_ns)(void);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
+	bool (*initial_ns_propagates)(void);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
diff --git a/lib/kobject.c b/lib/kobject.c
index c58c62d49a10..96bb8c732d1c 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -121,6 +121,8 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
 		sysfs_enable_ns(kobj->sd, ops->type);
+		if (ops->initial_ns_propagates && ops->initial_ns_propagates())
+			kernfs_enable_init_ns_propagates(kobj->sd);
 	}
 
 	return 0;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add()
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
                   ` (4 preceding siblings ...)
  2020-04-22 14:54 ` [PATCH v2 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-23  1:36   ` Serge E. Hallyn
  2020-04-22 14:54 ` [PATCH v2 7/7] loopfs: only show devices in their correct instance Christian Brauner
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Tag loop devices with the namespace the loopfs instance was mounted in.
This has the consequence that loopfs devices carry the correct sysfs
permissions for all their core files. All other devices files will
continue to be correctly owned by the initial namespaces. Here is sample
output:

root@b1:~# mount -t loop loop /mnt
root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
root@b1:~# losetup -f
/dev/loop8
root@b1:~# ln -sf /mnt/loop8 /dev/loop8
root@b1:~# ls -al /sys/class/block/loop8
lrwxrwxrwx 1 root root 0 Apr  7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
root@b1:~# ls -al /sys/class/block/loop8/
total 0
drwxr-xr-x  9 root   root       0 Apr  7 13:06 .
drwxr-xr-x 18 nobody nogroup    0 Apr  7 13:07 ..
-r--r--r--  1 root   root    4096 Apr  7 13:06 alignment_offset
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 bdi -> ../../bdi/7:8
-r--r--r--  1 root   root    4096 Apr  7 13:06 capability
-r--r--r--  1 root   root    4096 Apr  7 13:06 dev
-r--r--r--  1 root   root    4096 Apr  7 13:06 discard_alignment
-r--r--r--  1 root   root    4096 Apr  7 13:06 events
-r--r--r--  1 root   root    4096 Apr  7 13:06 events_async
-rw-r--r--  1 root   root    4096 Apr  7 13:06 events_poll_msecs
-r--r--r--  1 root   root    4096 Apr  7 13:06 ext_range
-r--r--r--  1 root   root    4096 Apr  7 13:06 hidden
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 holders
-r--r--r--  1 root   root    4096 Apr  7 13:06 inflight
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 integrity
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 mq
drwxr-xr-x  2 root   root       0 Apr  7 13:06 power
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 queue
-r--r--r--  1 root   root    4096 Apr  7 13:06 range
-r--r--r--  1 root   root    4096 Apr  7 13:06 removable
-r--r--r--  1 root   root    4096 Apr  7 13:06 ro
-r--r--r--  1 root   root    4096 Apr  7 13:06 size
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 slaves
-r--r--r--  1 root   root    4096 Apr  7 13:06 stat
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 subsystem -> ../../../../class/block
drwxr-xr-x  2 root   root       0 Apr  7 13:06 trace
-rw-r--r--  1 root   root    4096 Apr  7 13:06 uevent
root@b1:~#

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Adapted commit message otherwise unchanged.
---
 drivers/block/loop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8e21d4b33e01..2dc53bad4b48 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2212,6 +2212,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (loopfs_i_sb(inode))
+		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+#endif
 
 	add_disk(disk);
 
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/7] loopfs: only show devices in their correct instance
  2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
                   ` (5 preceding siblings ...)
  2020-04-22 14:54 ` [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
@ 2020-04-22 14:54 ` Christian Brauner
  2020-04-23  1:37   ` Serge E. Hallyn
  6 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-22 14:54 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Since loopfs devices belong to a loopfs instance they have no business
polluting the host's devtmpfs mount and should not propagate out of the
namespace they belong to.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
---
 drivers/base/devtmpfs.c | 4 ++--
 drivers/block/loop.c    | 4 +++-
 include/linux/device.h  | 3 +++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..77371ceb88fa 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -111,7 +111,7 @@ int devtmpfs_create_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.mode = 0;
@@ -138,7 +138,7 @@ int devtmpfs_delete_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2dc53bad4b48..5548151b9f11 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2213,8 +2213,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
 #ifdef CONFIG_BLK_DEV_LOOPFS
-	if (loopfs_i_sb(inode))
+	if (loopfs_i_sb(inode)) {
 		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+		disk_to_dev(disk)->no_devnode = true;
+	}
 #endif
 
 	add_disk(disk);
diff --git a/include/linux/device.h b/include/linux/device.h
index ac8e37cd716a..c69ef1c5a0ef 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -523,6 +523,8 @@ struct dev_links_info {
  *		  sync_state() callback.
  * @dma_coherent: this particular device is dma coherent, even if the
  *		architecture supports non-coherent devices.
+ * @no_devnode: whether device nodes associated with this device are kept out
+ *		of devtmpfs (e.g. due to separate filesystem)
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -622,6 +624,7 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+	bool			no_devnode:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check
  2020-04-22 14:54 ` [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
@ 2020-04-22 16:34   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 20+ messages in thread
From: Greg Kroah-Hartman @ 2020-04-22 16:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, linux-kernel, linux-block, linux-api,
	Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev,
	Steve Barber, Dylan Reid, Filipe Brandenburger, Kees Cook,
	Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:31PM +0200, Christian Brauner wrote:
> Back when I rewrote large chunks of uevent sending I should have removed
> the .netlink_ns method completely after having removed it's last user in
> [1]. Let's remove it now and also remove the helper associated with it
> that is unused too.
> 
> Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/7] loop: use ns_capable for some loop operations
  2020-04-22 14:54 ` [PATCH v2 3/7] loop: use ns_capable for some loop operations Christian Brauner
@ 2020-04-22 20:50   ` Serge E. Hallyn
  0 siblings, 0 replies; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-22 20:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:33PM +0200, Christian Brauner wrote:
> The following  LOOP_GET_STATUS, LOOP_SET_STATUS, and LOOP_SET_BLOCK_SIZE
> operations are now allowed in non-initial namespaces. Most other
> operations were already possible before.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Seth Forshee <seth.forshee@canonical.com>
> Cc: Tom Gundersen <teg@jklm.no>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christian Kellner <ckellner@redhat.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Rheinsberg <david.rheinsberg@gmail.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Adapated loop_capable() based on changes in the loopfs
>     implementation patchset. Otherwise it is functionally equivalent to
>     the v1 version.
> ---
>  drivers/block/loop.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 52f7583dd17d..8e21d4b33e01 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1352,6 +1352,16 @@ void loopfs_evict_locked(struct loop_device *lo)
>  	}
>  	mutex_unlock(&loop_ctl_mutex);
>  }
> +
> +static bool loop_capable(const struct loop_device *lo, int cap)
> +{
> +	return ns_capable(loopfs_ns(lo), cap);
> +}
> +#else /* !CONFIG_BLK_DEV_LOOPFS */
> +static inline bool loop_capable(const struct loop_device *lo, int cap)
> +{
> +	return capable(cap);
> +}
>  #endif /* CONFIG_BLK_DEV_LOOPFS */
>  
>  static int
> @@ -1368,7 +1378,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
>  		return err;
>  	if (lo->lo_encrypt_key_size &&
>  	    !uid_eq(lo->lo_key_owner, uid) &&
> -	    !capable(CAP_SYS_ADMIN)) {
> +	    !loop_capable(lo, CAP_SYS_ADMIN)) {
>  		err = -EPERM;
>  		goto out_unlock;
>  	}
> @@ -1499,7 +1509,7 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info)
>  	memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE);
>  	info->lo_encrypt_type =
>  		lo->lo_encryption ? lo->lo_encryption->number : 0;
> -	if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) {
> +	if (lo->lo_encrypt_key_size && loop_capable(lo, CAP_SYS_ADMIN)) {
>  		info->lo_encrypt_key_size = lo->lo_encrypt_key_size;
>  		memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
>  		       lo->lo_encrypt_key_size);
> @@ -1723,7 +1733,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
>  		return loop_clr_fd(lo);
>  	case LOOP_SET_STATUS:
>  		err = -EPERM;
> -		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
> +		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
>  			err = loop_set_status_old(lo,
>  					(struct loop_info __user *)arg);
>  		}
> @@ -1732,7 +1742,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
>  		return loop_get_status_old(lo, (struct loop_info __user *) arg);
>  	case LOOP_SET_STATUS64:
>  		err = -EPERM;
> -		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
> +		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
>  			err = loop_set_status64(lo,
>  					(struct loop_info64 __user *) arg);
>  		}
> @@ -1742,7 +1752,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
>  	case LOOP_SET_CAPACITY:
>  	case LOOP_SET_DIRECT_IO:
>  	case LOOP_SET_BLOCK_SIZE:
> -		if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
> +		if (!(mode & FMODE_WRITE) && !loop_capable(lo, CAP_SYS_ADMIN))
>  			return -EPERM;
>  		/* Fall through */
>  	default:
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] loopfs: implement loopfs
  2020-04-22 14:54 ` [PATCH v2 2/7] loopfs: implement loopfs Christian Brauner
@ 2020-04-22 21:52   ` Serge E. Hallyn
  2020-04-23 11:24     ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-22 21:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:32PM +0200, Christian Brauner wrote:
> This implements loopfs, a loop device filesystem. It takes inspiration
> from the binderfs filesystem I implemented about two years ago and with
> which we had overall good experiences so far. Parts of it are also
> based on [3] but it's mostly a new, imho cleaner approach.
> 
> Loopfs allows to create private loop devices instances to applications
> for various use-cases. It covers the use-case that was expressed on-list
> and in-person to get programmatic access to private loop devices for
> image building in sandboxes. An illustration for this is provided in
> [4].
> 
> Also loopfs is intended to provide loop devices to privileged and
> unprivileged containers which has been a frequent request from various
> major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
> providing a non-exhaustive list of issues and requests (cf. [5]) around
> this feature mainly to illustrate that I'm not making the use-cases up.
> Currently none of this can be done safely since handing a loop device
> from the host into a container means that the container can see anything
> that the host is doing with that loop device and what other containers
> are doing with that device too. And (bind-)mounting devtmpfs inside of
> containers is not secure at all so also not an option (though sometimes
> done out of despair apparently).
> 
> The workloads people run in containers are supposed to be indiscernible
> from workloads run on the host and the tools inside of the container are
> supposed to not be required to be aware that they are running inside a
> container apart from containerization tools themselves. This is
> especially true when running older distros in containers that did exist
> before containers were as ubiquitous as they are today. With loopfs user
> can call mount -o loop and in a correctly setup container things work
> the same way they would on the host. The filesystem representation
> allows us to do this in a very simple way. At container setup, a
> container manager can mount a private instance of loopfs somehwere, e.g.
> at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
> to /dev/loop-control, pre allocate and symlink the number of standard
> devices into their standard location and have a service file or rules in
> place that symlink additionally allocated loop devices through losetup
> into place as well.
> With the new syscall interception logic this is also possible for
> unprivileged containers. In these cases when a user calls mount -o loop
> <image> <mountpoint> it will be possible to completely setup the loop
> device in the container. The final mount syscall is handled through
> syscall interception which we already implemented and released in
> earlier kernels (see [1] and [2]) and is actively used in production
> workloads. The mount is often rewritten to a fuse binary to provide safe
> access for unprivileged containers.
> 
> Loopfs also allows the creation of hidden/detached dynamic loop devices
> and associated mounts which also was a often issued request. With the
> old mount api this can be achieved by creating a temporary loopfs and
> stashing a file descriptor to the mount point and the loop-control
> device and immediately unmounting the loopfs instance.  With the new
> mount api a detached mount can be created directly (i.e. a mount not
> visible anywhere in the filesystem). New loop devices can then be
> allocated and configured. They can be mounted through
> /proc/self/<fd>/<nr> with the old mount api or by using the fd directly
> with the new mount api. Combined with a mount namespace this allows for
> fully auto-cleaned up loop devices on program crash. This ties back to
> various use-cases and is illustrated in [4].
> 
> The filesystem representation requires the standard boilerplate
> filesystem code we know from other tiny filesystems. And all of
> the loopfs code is hidden under a config option that defaults to false.
> This specifically means, that none of the code even exists when users do
> not have any use-case for loopfs.
> In addition, the loopfs code does not alter how loop devices behave at
> all, i.e. there are no changes to any existing workloads and I've taken
> care to ifdef all loopfs specific things out.
> 
> Each loopfs mount is a separate instance. As such loop devices created
> in one instance are independent of loop devices created in another
> instance. This specifically entails that loop devices are only visible
> in the loopfs instance they belong to.
> 
> The number of loop devices available in loopfs instances are
> hierarchically limited through /proc/sys/user/max_loop_devices via the
> ucount infrastructure (Thanks to David Rheinsberg for pointing out that
> missing piece.). An administrator could e.g. set
> echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
> instance mounted by uid x can only create 3 loop devices no matter how
> many loopfs instances they mount. This limit applies hierarchically to
> all user namespaces.

Hm, info->device_count is per loopfs mount, though, right?  I don't
see where this gets incremented for all of a user's loopfs mounts
when one adds a loopdev?

I'm sure I'm missing something obvious...

> In addition, loopfs has a "max" mount option which allows to set a limit
> on the number of loop devices for a given loopfs instance. This is
> mainly to cover use-cases where a single loopfs mount is shared as a
> bind-mount between multiple parties that are prevented from creating
> other loopfs mounts and is equivalent to the semantics of the binderfs
> and devpts "max" mount option.
> 
> Note that in __loop_clr_fd() we now need not just check whether bdev is
> valid but also whether bdev->bd_disk is valid. This wasn't necessary
> before because in order to call LOOP_CLR_FD the loop device would need
> to be open and thus bdev->bd_disk was guaranteed to be allocated. For
> loopfs loop devices we allow callers to simply unlink them just as we do
> for binderfs binder devices and we do also need to account for the case
> where a loopfs superblock is shutdown while backing files might still be
> associated with some loop devices. In such cases no bd_disk device will
> be attached to bdev. This is not in itself noteworthy it's more about
> documenting the "why" of the added bdev->bd_disk check for posterity.
> 
> [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
> [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
> [4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
> [5]: https://github.com/kubernetes-sigs/kind/issues/1333
>      https://github.com/kubernetes-sigs/kind/issues/1248
>      https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
>      https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
>      https://gitlab.com/gitlab-com/support-forum/issues/3732
>      https://github.com/moby/moby/issues/27886
>      https://twitter.com/_AkihiroSuda_/status/1249664478267854848
>      https://serverfault.com/questions/701384/loop-device-in-a-linux-container
>      https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
>      https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Steve Barber <smbarber@google.com>
> Cc: Filipe Brandenburger <filbranden@gmail.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Benjamin Elder <bentheelder@google.com>
> Cc: Seth Forshee <seth.forshee@canonical.com>
> Cc: Stéphane Graber <stgraber@ubuntu.com>
> Cc: Tom Gundersen <teg@jklm.no>
> Cc: Serge Hallyn <serge@hallyn.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christian Kellner <ckellner@redhat.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Dylan Reid <dgreid@google.com>
> Cc: David Rheinsberg <david.rheinsberg@gmail.com>
> Cc: Akihiro Suda <suda.kyoto@gmail.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> - David Rheinsberg <david.rheinsberg@gmail.com> /
>   Christian Brauner <christian.brauner@ubuntu.com>:
>   - Correctly cleanup loop devices that are in-use after the loopfs
>     instance has been shut down. This is important for some use-cases
>     that David pointed out where they effectively create a loopfs
>     instance, allocate devices and drop unnecessary references to it.
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Replace lo_loopfs_i inode member in struct loop_device with a custom
>     struct lo_info pointer which is only allocated for loopfs loop
>     devices.
> ---
>  MAINTAINERS                    |   5 +
>  drivers/block/Kconfig          |   4 +
>  drivers/block/Makefile         |   1 +
>  drivers/block/loop.c           | 200 ++++++++++---
>  drivers/block/loop.h           |  12 +-
>  drivers/block/loopfs/Makefile  |   3 +
>  drivers/block/loopfs/loopfs.c  | 494 +++++++++++++++++++++++++++++++++
>  drivers/block/loopfs/loopfs.h  |  36 +++
>  include/linux/user_namespace.h |   3 +
>  include/uapi/linux/magic.h     |   1 +
>  kernel/ucount.c                |   3 +
>  11 files changed, 721 insertions(+), 41 deletions(-)
>  create mode 100644 drivers/block/loopfs/Makefile
>  create mode 100644 drivers/block/loopfs/loopfs.c
>  create mode 100644 drivers/block/loopfs/loopfs.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b816a453b10e..560b37a65bce 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9957,6 +9957,11 @@ W:	http://www.avagotech.com/support/
>  F:	drivers/message/fusion/
>  F:	drivers/scsi/mpt3sas/
>  
> +LOOPFS FILE SYSTEM
> +M:	Christian Brauner <christian.brauner@ubuntu.com>
> +S:	Supported
> +F:	drivers/block/loopfs/
> +
>  LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers
>  M:	Matthew Wilcox <willy@infradead.org>
>  L:	linux-scsi@vger.kernel.org
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 025b1b77b11a..d7ff37d795ad 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -214,6 +214,10 @@ config BLK_DEV_LOOP
>  
>  	  Most users will answer N here.
>  
> +config BLK_DEV_LOOPFS
> +	bool "Loopback device virtual filesystem support"
> +	depends on BLK_DEV_LOOP=y
> +
>  config BLK_DEV_LOOP_MIN_COUNT
>  	int "Number of loop devices to pre-create at init time"
>  	depends on BLK_DEV_LOOP
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index 795facd8cf19..7052be26aa8b 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -36,6 +36,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
>  obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
>  obj-$(CONFIG_BLK_DEV_RBD)     += rbd.o
>  obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
> +obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs/
>  
>  obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
>  obj-$(CONFIG_ZRAM) += zram/
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da693e6a834e..52f7583dd17d 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -81,6 +81,10 @@
>  
>  #include "loop.h"
>  
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +#include "loopfs/loopfs.h"
> +#endif
> +
>  #include <linux/uaccess.h>
>  
>  static DEFINE_IDR(loop_index_idr);
> @@ -1115,6 +1119,24 @@ loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer,
>  	return err;
>  }
>  
> +static void loop_remove(struct loop_device *lo)
> +{
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	loopfs_remove(lo);
> +#endif
> +	del_gendisk(lo->lo_disk);
> +	blk_cleanup_queue(lo->lo_queue);
> +	blk_mq_free_tag_set(&lo->tag_set);
> +	put_disk(lo->lo_disk);
> +	kfree(lo);
> +}
> +
> +static inline void __loop_remove(struct loop_device *lo)
> +{
> +	idr_remove(&loop_index_idr, lo->lo_number);
> +	loop_remove(lo);
> +}
> +
>  static int __loop_clr_fd(struct loop_device *lo, bool release)
>  {
>  	struct file *filp = NULL;
> @@ -1164,7 +1186,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	}
>  	set_capacity(lo->lo_disk, 0);
>  	loop_sysfs_exit(lo);
> -	if (bdev) {
> +	if (bdev && bdev->bd_disk) {
>  		bd_set_size(bdev, 0);
>  		/* let user-space know about this change */
>  		kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
> @@ -1174,7 +1196,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	module_put(THIS_MODULE);
>  	blk_mq_unfreeze_queue(lo->lo_queue);
>  
> -	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
> +	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk;
>  	lo_number = lo->lo_number;
>  	loop_unprepare_queue(lo);
>  out_unlock:
> @@ -1213,7 +1235,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>  	lo->lo_flags = 0;
>  	if (!part_shift)
>  		lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
> -	lo->lo_state = Lo_unbound;
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	if (loopfs_wants_remove(lo))
> +		__loop_remove(lo);
> +	else
> +#endif
> +		lo->lo_state = Lo_unbound;
>  	mutex_unlock(&loop_ctl_mutex);
>  
>  	/*
> @@ -1259,6 +1286,74 @@ static int loop_clr_fd(struct loop_device *lo)
>  	return __loop_clr_fd(lo, false);
>  }
>  
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +int loopfs_rundown_locked(struct loop_device *lo)
> +{
> +	int ret;
> +
> +	if (WARN_ON_ONCE(!loopfs_device(lo)))
> +		return -EINVAL;
> +
> +	ret = mutex_lock_killable(&loop_ctl_mutex);
> +	if (ret)
> +		return ret;
> +
> +	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) {
> +		ret = -EBUSY;
> +	} else {
> +		/*
> +		 * Since the device is unbound it has no associated backing
> +		 * file and we can safely set Lo_rundown to prevent it from
> +		 * being found. Actual cleanup happens during inode eviction.
> +		 */
> +		lo->lo_state = Lo_rundown;
> +		ret = 0;
> +	}
> +
> +	mutex_unlock(&loop_ctl_mutex);
> +	return ret;
> +}
> +
> +/**
> + * loopfs_evict_locked() - remove loop device or mark inactive
> + * @lo:	loopfs loop device
> + *
> + * This function will remove a loop device. If it has no users
> + * and is bound the backing file will be cleaned up. If the loop
> + * device has users it will be marked for auto cleanup.
> + * This function is only called when a loopfs instance is shutdown
> + * when all references to it from this loopfs instance have been
> + * dropped. If there are still any references to it cleanup will
> + * happen in lo_release().
> + */
> +void loopfs_evict_locked(struct loop_device *lo)
> +{
> +	struct lo_loopfs *lo_info;
> +	struct inode *lo_inode;
> +
> +	WARN_ON_ONCE(!loopfs_device(lo));
> +
> +	mutex_lock(&loop_ctl_mutex);
> +	lo_info = lo->lo_info;
> +	lo_inode = lo_info->lo_inode;
> +	lo_info->lo_inode = NULL;
> +	lo_info->lo_flags |= LOOPFS_FLAGS_INACTIVE;
> +
> +	if (atomic_read(&lo->lo_refcnt) > 0) {
> +		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
> +	} else {
> +		lo->lo_state = Lo_rundown;
> +		lo->lo_disk->private_data = NULL;
> +		lo_inode->i_private = NULL;
> +
> +		mutex_unlock(&loop_ctl_mutex);
> +		__loop_clr_fd(lo, false);
> +		return;
> +	}
> +	mutex_unlock(&loop_ctl_mutex);
> +}
> +#endif /* CONFIG_BLK_DEV_LOOPFS */
> +
>  static int
>  loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
>  {
> @@ -1842,7 +1937,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
>  
>  	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
>  		if (lo->lo_state != Lo_bound)
> -			goto out_unlock;
> +			goto out_remove;
>  		lo->lo_state = Lo_rundown;
>  		mutex_unlock(&loop_ctl_mutex);
>  		/*
> @@ -1860,6 +1955,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
>  		blk_mq_unfreeze_queue(lo->lo_queue);
>  	}
>  
> +out_remove:
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	if (lo->lo_state != Lo_bound && loopfs_wants_remove(lo))
> +		__loop_remove(lo);
> +#endif
> +
>  out_unlock:
>  	mutex_unlock(&loop_ctl_mutex);
>  }
> @@ -1878,6 +1979,11 @@ static const struct block_device_operations lo_fops = {
>   * And now the modules code and kernel interface.
>   */
>  static int max_loop;
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +unsigned long max_devices;
> +#else
> +static unsigned long max_devices;
> +#endif
>  module_param(max_loop, int, 0444);
>  MODULE_PARM_DESC(max_loop, "Maximum number of loop devices");
>  module_param(max_part, int, 0444);
> @@ -2006,7 +2112,7 @@ static const struct blk_mq_ops loop_mq_ops = {
>  	.complete	= lo_complete_rq,
>  };
>  
> -static int loop_add(struct loop_device **l, int i)
> +static int loop_add(struct loop_device **l, int i, struct inode *inode)
>  {
>  	struct loop_device *lo;
>  	struct gendisk *disk;
> @@ -2096,7 +2202,17 @@ static int loop_add(struct loop_device **l, int i)
>  	disk->private_data	= lo;
>  	disk->queue		= lo->lo_queue;
>  	sprintf(disk->disk_name, "loop%d", i);
> +
>  	add_disk(disk);
> +
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	err = loopfs_add(lo, inode, disk_devt(disk));
> +	if (err) {
> +		__loop_remove(lo);
> +		goto out;
> +	}
> +#endif
> +
>  	*l = lo;
>  	return lo->lo_number;
>  
> @@ -2112,36 +2228,41 @@ static int loop_add(struct loop_device **l, int i)
>  	return err;
>  }
>  
> -static void loop_remove(struct loop_device *lo)
> -{
> -	del_gendisk(lo->lo_disk);
> -	blk_cleanup_queue(lo->lo_queue);
> -	blk_mq_free_tag_set(&lo->tag_set);
> -	put_disk(lo->lo_disk);
> -	kfree(lo);
> -}
> +struct find_free_cb_data {
> +	struct loop_device **l;
> +	struct inode *inode;
> +};
>  
>  static int find_free_cb(int id, void *ptr, void *data)
>  {
>  	struct loop_device *lo = ptr;
> -	struct loop_device **l = data;
> +	struct find_free_cb_data *cb_data = data;
>  
> -	if (lo->lo_state == Lo_unbound) {
> -		*l = lo;
> -		return 1;
> -	}
> -	return 0;
> +	if (lo->lo_state != Lo_unbound)
> +		return 0;
> +
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	if (!loopfs_access(cb_data->inode, lo))
> +		return 0;
> +#endif
> +
> +	*cb_data->l = lo;
> +	return 1;
>  }
>  
> -static int loop_lookup(struct loop_device **l, int i)
> +static int loop_lookup(struct loop_device **l, int i, struct inode *inode)
>  {
>  	struct loop_device *lo;
>  	int ret = -ENODEV;
>  
>  	if (i < 0) {
>  		int err;
> +		struct find_free_cb_data cb_data = {
> +			.l = &lo,
> +			.inode = inode,
> +		};
>  
> -		err = idr_for_each(&loop_index_idr, &find_free_cb, &lo);
> +		err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data);
>  		if (err == 1) {
>  			*l = lo;
>  			ret = lo->lo_number;
> @@ -2152,6 +2273,11 @@ static int loop_lookup(struct loop_device **l, int i)
>  	/* lookup and return a specific i */
>  	lo = idr_find(&loop_index_idr, i);
>  	if (lo) {
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +		if (!loopfs_access(inode, lo))
> +			return -EACCES;
> +#endif
> +
>  		*l = lo;
>  		ret = lo->lo_number;
>  	}
> @@ -2166,9 +2292,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data)
>  	int err;
>  
>  	mutex_lock(&loop_ctl_mutex);
> -	err = loop_lookup(&lo, MINOR(dev) >> part_shift);
> +	err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL);
>  	if (err < 0)
> -		err = loop_add(&lo, MINOR(dev) >> part_shift);
> +		err = loop_add(&lo, MINOR(dev) >> part_shift, NULL);
>  	if (err < 0)
>  		kobj = NULL;
>  	else
> @@ -2192,15 +2318,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
>  	ret = -ENOSYS;
>  	switch (cmd) {
>  	case LOOP_CTL_ADD:
> -		ret = loop_lookup(&lo, parm);
> +		ret = loop_lookup(&lo, parm, file_inode(file));
>  		if (ret >= 0) {
>  			ret = -EEXIST;
>  			break;
>  		}
> -		ret = loop_add(&lo, parm);
> +		ret = loop_add(&lo, parm, file_inode(file));
>  		break;
>  	case LOOP_CTL_REMOVE:
> -		ret = loop_lookup(&lo, parm);
> +		ret = loop_lookup(&lo, parm, file_inode(file));
>  		if (ret < 0)
>  			break;
>  		if (lo->lo_state != Lo_unbound) {
> @@ -2212,14 +2338,13 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
>  			break;
>  		}
>  		lo->lo_disk->private_data = NULL;
> -		idr_remove(&loop_index_idr, lo->lo_number);
> -		loop_remove(lo);
> +		__loop_remove(lo);
>  		break;
>  	case LOOP_CTL_GET_FREE:
> -		ret = loop_lookup(&lo, -1);
> +		ret = loop_lookup(&lo, -1, file_inode(file));
>  		if (ret >= 0)
>  			break;
> -		ret = loop_add(&lo, -1);
> +		ret = loop_add(&lo, -1, file_inode(file));
>  	}
>  	mutex_unlock(&loop_ctl_mutex);
>  
> @@ -2246,7 +2371,6 @@ MODULE_ALIAS("devname:loop-control");
>  static int __init loop_init(void)
>  {
>  	int i, nr;
> -	unsigned long range;
>  	struct loop_device *lo;
>  	int err;
>  
> @@ -2285,10 +2409,10 @@ static int __init loop_init(void)
>  	 */
>  	if (max_loop) {
>  		nr = max_loop;
> -		range = max_loop << part_shift;
> +		max_devices = max_loop << part_shift;
>  	} else {
>  		nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT;
> -		range = 1UL << MINORBITS;
> +		max_devices = 1UL << MINORBITS;
>  	}
>  
>  	err = misc_register(&loop_misc);
> @@ -2301,13 +2425,13 @@ static int __init loop_init(void)
>  		goto misc_out;
>  	}
>  
> -	blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
> +	blk_register_region(MKDEV(LOOP_MAJOR, 0), max_devices,
>  				  THIS_MODULE, loop_probe, NULL, NULL);
>  
>  	/* pre-create number of devices given by config or max_loop */
>  	mutex_lock(&loop_ctl_mutex);
>  	for (i = 0; i < nr; i++)
> -		loop_add(&lo, i);
> +		loop_add(&lo, i, NULL);
>  	mutex_unlock(&loop_ctl_mutex);
>  
>  	printk(KERN_INFO "loop: module loaded\n");
> @@ -2329,14 +2453,10 @@ static int loop_exit_cb(int id, void *ptr, void *data)
>  
>  static void __exit loop_exit(void)
>  {
> -	unsigned long range;
> -
> -	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
> -
>  	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
>  	idr_destroy(&loop_index_idr);
>  
> -	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
> +	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), max_devices);
>  	unregister_blkdev(LOOP_MAJOR, "loop");
>  
>  	misc_deregister(&loop_misc);
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index af75a5ee4094..6fed746b6124 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -17,6 +17,10 @@
>  #include <linux/kthread.h>
>  #include <uapi/linux/loop.h>
>  
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +#include "loopfs/loopfs.h"
> +#endif
> +
>  /* Possible states of device */
>  enum {
>  	Lo_unbound,
> @@ -62,6 +66,9 @@ struct loop_device {
>  	struct request_queue	*lo_queue;
>  	struct blk_mq_tag_set	tag_set;
>  	struct gendisk		*lo_disk;
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	struct lo_loopfs	*lo_info;
> +#endif
>  };
>  
>  struct loop_cmd {
> @@ -89,6 +96,9 @@ struct loop_func_table {
>  }; 
>  
>  int loop_register_transfer(struct loop_func_table *funcs);
> -int loop_unregister_transfer(int number); 
> +int loop_unregister_transfer(int number);
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +extern unsigned long max_devices;
> +#endif
>  
>  #endif
> diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile
> new file mode 100644
> index 000000000000..87ec703b662e
> --- /dev/null
> +++ b/drivers/block/loopfs/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +loopfs-y			:= loopfs.o
> +obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs.o
> diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
> new file mode 100644
> index 000000000000..b3461c72b6e7
> --- /dev/null
> +++ b/drivers/block/loopfs/loopfs.c
> @@ -0,0 +1,494 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/fs.h>
> +#include <linux/fs_parser.h>
> +#include <linux/fsnotify.h>
> +#include <linux/genhd.h>
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/magic.h>
> +#include <linux/major.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +
> +#include "../loop.h"
> +#include "loopfs.h"
> +
> +#define FIRST_INODE 1
> +#define SECOND_INODE 2
> +#define INODE_OFFSET 3
> +
> +enum loopfs_param {
> +	Opt_max,
> +};
> +
> +const struct fs_parameter_spec loopfs_fs_parameters[] = {
> +	fsparam_u32("max",	Opt_max),
> +	{}
> +};
> +
> +struct loopfs_mount_opts {
> +	int max;
> +};
> +
> +struct loopfs_info {
> +	kuid_t root_uid;
> +	kgid_t root_gid;
> +	unsigned long device_count;
> +	struct dentry *control_dentry;
> +	struct loopfs_mount_opts mount_opts;
> +};
> +
> +static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb)
> +{
> +	return sb->s_fs_info;
> +}
> +
> +struct super_block *loopfs_i_sb(const struct inode *inode)
> +{
> +	if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC)
> +		return inode->i_sb;
> +
> +	return NULL;
> +}
> +
> +bool loopfs_device(const struct loop_device *lo)
> +{
> +	return lo->lo_info != NULL;
> +}
> +
> +struct user_namespace *loopfs_ns(const struct loop_device *lo)
> +{
> +	if (loopfs_device(lo)) {
> +		struct super_block *sb;
> +
> +		sb = loopfs_i_sb(lo->lo_info->lo_inode);
> +		if (sb)
> +			return sb->s_user_ns;
> +	}
> +
> +	return &init_user_ns;
> +}
> +
> +bool loopfs_access(const struct inode *first, struct loop_device *lo)
> +{
> +	return loopfs_device(lo) &&
> +	       loopfs_i_sb(first) == loopfs_i_sb(lo->lo_info->lo_inode);
> +}
> +
> +bool loopfs_wants_remove(const struct loop_device *lo)
> +{
> +	return lo->lo_info && (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE);
> +}
> +
> +/**
> + * loopfs_add - allocate inode from super block of a loopfs mount
> + * @lo:		loop device for which we are creating a new device entry
> + * @ref_inode:	inode from wich the super block will be taken
> + * @device_nr:  device number of the associated disk device
> + *
> + * This function creates a new device node for @lo.
> + * Minor numbers are limited and tracked globally. The
> + * function will stash a struct loop_device for the specific loop
> + * device in i_private of the inode.
> + * It will go on to allocate a new inode from the super block of the
> + * filesystem mount, stash a struct loop_device in its i_private field
> + * and attach a dentry to that inode.
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +int loopfs_add(struct loop_device *lo, struct inode *ref_inode, dev_t device_nr)
> +{
> +	int ret;
> +	char name[DISK_NAME_LEN];
> +	struct super_block *sb;
> +	struct loopfs_info *info;
> +	struct dentry *root, *dentry;
> +	struct inode *inode;
> +	struct lo_loopfs *lo_info;
> +
> +	sb = loopfs_i_sb(ref_inode);
> +	if (!sb)
> +		return 0;
> +
> +	if (MAJOR(device_nr) != LOOP_MAJOR)
> +		return -EINVAL;
> +
> +	lo_info = kzalloc(sizeof(struct lo_loopfs), GFP_KERNEL);
> +	if (!lo_info) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	info = LOOPFS_SB(sb);
> +	if ((info->device_count + 1) > info->mount_opts.max) {
> +		ret = -ENOSPC;
> +		goto err;
> +	}
> +
> +	lo_info->lo_ucount = inc_ucount(sb->s_user_ns,
> +					info->root_uid, UCOUNT_LOOP_DEVICES);
> +	if (!lo_info->lo_ucount) {
> +		ret = -ENOSPC;
> +		goto err;
> +	}
> +
> +	if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	inode = new_inode(sb);
> +	if (!inode) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	/*
> +	 * The i_fop field will be set to the correct fops by the device layer
> +	 * when the loop device in this loopfs instance is opened.
> +	 */
> +	inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
> +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> +	inode->i_uid = info->root_uid;
> +	inode->i_gid = info->root_gid;
> +	init_special_inode(inode, S_IFBLK | 0600, device_nr);
> +
> +	root = sb->s_root;
> +	inode_lock(d_inode(root));
> +	/* look it up */
> +	dentry = lookup_one_len(name, root, strlen(name));
> +	if (IS_ERR(dentry)) {
> +		inode_unlock(d_inode(root));
> +		iput(inode);
> +		ret = PTR_ERR(dentry);
> +		goto err;
> +	}
> +
> +	if (d_really_is_positive(dentry)) {
> +		/* already exists */
> +		dput(dentry);
> +		inode_unlock(d_inode(root));
> +		iput(inode);
> +		ret = -EEXIST;
> +		goto err;
> +	}
> +
> +	d_instantiate(dentry, inode);
> +	fsnotify_create(d_inode(root), dentry);
> +	inode_unlock(d_inode(root));
> +
> +	lo_info->lo_inode = inode;
> +	lo->lo_info = lo_info;
> +	inode->i_private = lo;
> +	info->device_count++;
> +
> +	return 0;
> +
> +err:
> +	if (lo_info->lo_ucount)
> +		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
> +	kfree(lo_info);
> +	return ret;
> +}
> +
> +void loopfs_remove(struct loop_device *lo)
> +{
> +	struct lo_loopfs *lo_info = lo->lo_info;
> +	struct inode *inode;
> +	struct super_block *sb;
> +	struct dentry *root, *dentry;
> +
> +	if (!lo_info)
> +		return;
> +
> +	inode = lo_info->lo_inode;
> +	if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR)
> +		goto out;
> +
> +	sb = loopfs_i_sb(inode);
> +	lo_info->lo_inode = NULL;
> +
> +	/*
> +	 * The root dentry is always the parent dentry since we don't allow
> +	 * creation of directories.
> +	 */
> +	root = sb->s_root;
> +
> +	inode_lock(d_inode(root));
> +	dentry = d_find_any_alias(inode);
> +	if (dentry && simple_positive(dentry)) {
> +		simple_unlink(d_inode(root), dentry);
> +		d_delete(dentry);
> +	}
> +	dput(dentry);
> +	inode_unlock(d_inode(root));
> +	LOOPFS_SB(sb)->device_count--;
> +
> +out:
> +	if (lo_info->lo_ucount)
> +		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
> +	kfree(lo->lo_info);
> +	lo->lo_info = NULL;
> +}
> +
> +static void loopfs_fs_context_free(struct fs_context *fc)
> +{
> +	struct loopfs_mount_opts *ctx = fc->fs_private;
> +
> +	kfree(ctx);
> +}
> +
> +/**
> + * loopfs_loop_ctl_create - create a new loop-control device
> + * @sb: super block of the loopfs mount
> + *
> + * This function creates a new loop-control device node in the loopfs mount
> + * referred to by @sb.
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +static int loopfs_loop_ctl_create(struct super_block *sb)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode = NULL;
> +	struct dentry *root = sb->s_root;
> +	struct loopfs_info *info = sb->s_fs_info;
> +
> +	if (info->control_dentry)
> +		return 0;
> +
> +	inode = new_inode(sb);
> +	if (!inode)
> +		return -ENOMEM;
> +
> +	inode->i_ino = SECOND_INODE;
> +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> +	init_special_inode(inode, S_IFCHR | 0600,
> +			   MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR));
> +	/*
> +	 * The i_fop field will be set to the correct fops by the device layer
> +	 * when the loop-control device in this loopfs instance is opened.
> +	 */
> +	inode->i_uid = info->root_uid;
> +	inode->i_gid = info->root_gid;
> +
> +	dentry = d_alloc_name(root, "loop-control");
> +	if (!dentry) {
> +		iput(inode);
> +		return -ENOMEM;
> +	}
> +
> +	info->control_dentry = dentry;
> +	d_add(dentry, inode);
> +
> +	return 0;
> +}
> +
> +static inline bool is_loopfs_control_device(const struct dentry *dentry)
> +{
> +	return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry;
> +}
> +
> +static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> +			 struct inode *new_dir, struct dentry *new_dentry,
> +			 unsigned int flags)
> +{
> +	if (is_loopfs_control_device(old_dentry) ||
> +	    is_loopfs_control_device(new_dentry))
> +		return -EPERM;
> +
> +	return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
> +}
> +
> +static int loopfs_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +	int ret;
> +	struct loop_device *lo;
> +
> +	if (is_loopfs_control_device(dentry))
> +		return -EPERM;
> +
> +	lo = d_inode(dentry)->i_private;
> +	ret = loopfs_rundown_locked(lo);
> +	if (ret)
> +		return ret;
> +
> +	return simple_unlink(dir, dentry);
> +}
> +
> +static const struct inode_operations loopfs_dir_inode_operations = {
> +	.lookup = simple_lookup,
> +	.rename = loopfs_rename,
> +	.unlink = loopfs_unlink,
> +};
> +
> +static void loopfs_evict_inode(struct inode *inode)
> +{
> +	struct loop_device *lo = inode->i_private;
> +
> +	clear_inode(inode);
> +
> +	if (lo && S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) {
> +		loopfs_evict_locked(lo);
> +		LOOPFS_SB(inode->i_sb)->device_count--;
> +		inode->i_private = NULL;
> +	}
> +}
> +
> +static int loopfs_show_options(struct seq_file *seq, struct dentry *root)
> +{
> +	struct loopfs_info *info = LOOPFS_SB(root->d_sb);
> +
> +	if (info->mount_opts.max <= max_devices)
> +		seq_printf(seq, ",max=%d", info->mount_opts.max);
> +
> +	return 0;
> +}
> +
> +static void loopfs_put_super(struct super_block *sb)
> +{
> +	struct loopfs_info *info = sb->s_fs_info;
> +
> +	sb->s_fs_info = NULL;
> +	kfree(info);
> +}
> +
> +static const struct super_operations loopfs_super_ops = {
> +	.evict_inode    = loopfs_evict_inode,
> +	.show_options	= loopfs_show_options,
> +	.statfs         = simple_statfs,
> +	.put_super	= loopfs_put_super,
> +};
> +
> +static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc)
> +{
> +	struct loopfs_info *info;
> +	struct loopfs_mount_opts *ctx = fc->fs_private;
> +	struct inode *inode = NULL;
> +
> +	sb->s_blocksize = PAGE_SIZE;
> +	sb->s_blocksize_bits = PAGE_SHIFT;
> +
> +	sb->s_iflags &= ~SB_I_NODEV;
> +	sb->s_iflags |= SB_I_NOEXEC;
> +	sb->s_magic = LOOPFS_SUPER_MAGIC;
> +	sb->s_op = &loopfs_super_ops;
> +	sb->s_time_gran = 1;
> +
> +	sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL);
> +	if (!sb->s_fs_info)
> +		return -ENOMEM;
> +	info = sb->s_fs_info;
> +
> +	info->root_gid = make_kgid(sb->s_user_ns, 0);
> +	if (!gid_valid(info->root_gid))
> +		info->root_gid = GLOBAL_ROOT_GID;
> +	info->root_uid = make_kuid(sb->s_user_ns, 0);
> +	if (!uid_valid(info->root_uid))
> +		info->root_uid = GLOBAL_ROOT_UID;
> +	info->mount_opts.max = ctx->max;
> +
> +	inode = new_inode(sb);
> +	if (!inode)
> +		return -ENOMEM;
> +
> +	inode->i_ino = FIRST_INODE;
> +	inode->i_fop = &simple_dir_operations;
> +	inode->i_mode = S_IFDIR | 0755;
> +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> +	inode->i_op = &loopfs_dir_inode_operations;
> +	set_nlink(inode, 2);
> +
> +	sb->s_root = d_make_root(inode);
> +	if (!sb->s_root)
> +		return -ENOMEM;
> +
> +	return loopfs_loop_ctl_create(sb);
> +}
> +
> +static int loopfs_fs_context_get_tree(struct fs_context *fc)
> +{
> +	return get_tree_nodev(fc, loopfs_fill_super);
> +}
> +
> +static int loopfs_fs_context_parse_param(struct fs_context *fc,
> +					 struct fs_parameter *param)
> +{
> +	int opt;
> +	struct loopfs_mount_opts *ctx = fc->fs_private;
> +	struct fs_parse_result result;
> +
> +	opt = fs_parse(fc, loopfs_fs_parameters, param, &result);
> +	if (opt < 0)
> +		return opt;
> +
> +	switch (opt) {
> +	case Opt_max:
> +		if (result.uint_32 > max_devices)
> +			return invalfc(fc, "Bad value for '%s'", param->key);
> +
> +		ctx->max = result.uint_32;
> +		break;
> +	default:
> +		return invalfc(fc, "Unsupported parameter '%s'", param->key);
> +	}
> +
> +	return 0;
> +}
> +
> +static int loopfs_fs_context_reconfigure(struct fs_context *fc)
> +{
> +	struct loopfs_mount_opts *ctx = fc->fs_private;
> +	struct loopfs_info *info = LOOPFS_SB(fc->root->d_sb);
> +
> +	info->mount_opts.max = ctx->max;
> +	return 0;
> +}
> +
> +static const struct fs_context_operations loopfs_fs_context_ops = {
> +	.free		= loopfs_fs_context_free,
> +	.get_tree	= loopfs_fs_context_get_tree,
> +	.parse_param	= loopfs_fs_context_parse_param,
> +	.reconfigure	= loopfs_fs_context_reconfigure,
> +};
> +
> +static int loopfs_init_fs_context(struct fs_context *fc)
> +{
> +	struct loopfs_mount_opts *ctx = fc->fs_private;
> +
> +	ctx = kzalloc(sizeof(struct loopfs_mount_opts), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	ctx->max = max_devices;
> +
> +	fc->fs_private = ctx;
> +
> +	fc->ops = &loopfs_fs_context_ops;
> +
> +	return 0;
> +}
> +
> +static struct file_system_type loop_fs_type = {
> +	.name			= "loop",
> +	.init_fs_context	= loopfs_init_fs_context,
> +	.parameters		= loopfs_fs_parameters,
> +	.kill_sb		= kill_litter_super,
> +	.fs_flags		= FS_USERNS_MOUNT,
> +};
> +
> +int __init init_loopfs(void)
> +{
> +	init_user_ns.ucount_max[UCOUNT_LOOP_DEVICES] = 255;
> +	return register_filesystem(&loop_fs_type);
> +}
> +
> +module_init(init_loopfs);
> +MODULE_AUTHOR("Christian Brauner <christian.brauner@ubuntu.com>");
> +MODULE_DESCRIPTION("Loop device filesystem");
> diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
> new file mode 100644
> index 000000000000..2ee114aa3fa9
> --- /dev/null
> +++ b/drivers/block/loopfs/loopfs.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_LOOPFS_FS_H
> +#define _LINUX_LOOPFS_FS_H
> +
> +#include <linux/errno.h>
> +#include <linux/fs.h>
> +#include <linux/magic.h>
> +#include <linux/user_namespace.h>
> +
> +struct loop_device;
> +
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +
> +#define LOOPFS_FLAGS_INACTIVE (1 << 0)
> +
> +struct lo_loopfs {
> +	struct ucounts *lo_ucount;
> +	struct inode *lo_inode;
> +	int lo_flags;
> +};
> +
> +extern struct super_block *loopfs_i_sb(const struct inode *inode);
> +extern bool loopfs_device(const struct loop_device *lo);
> +extern struct user_namespace *loopfs_ns(const struct loop_device *lo);
> +extern bool loopfs_access(const struct inode *first, struct loop_device *lo);
> +extern int loopfs_add(struct loop_device *lo, struct inode *ref_inode,
> +		      dev_t device_nr);
> +extern void loopfs_remove(struct loop_device *lo);
> +extern bool loopfs_wants_remove(const struct loop_device *lo);
> +extern void loopfs_evict_locked(struct loop_device *lo);
> +extern int loopfs_rundown_locked(struct loop_device *lo);
> +
> +#endif
> +
> +#endif /* _LINUX_LOOPFS_FS_H */
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 6ef1c7109fc4..04a4891765c0 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -49,6 +49,9 @@ enum ucount_type {
>  #ifdef CONFIG_INOTIFY_USER
>  	UCOUNT_INOTIFY_INSTANCES,
>  	UCOUNT_INOTIFY_WATCHES,
> +#endif
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	UCOUNT_LOOP_DEVICES,
>  #endif
>  	UCOUNT_COUNTS,
>  };
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index d78064007b17..0817d093a012 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -75,6 +75,7 @@
>  #define BINFMTFS_MAGIC          0x42494e4d
>  #define DEVPTS_SUPER_MAGIC	0x1cd1
>  #define BINDERFS_SUPER_MAGIC	0x6c6f6f70
> +#define LOOPFS_SUPER_MAGIC	0x6c6f6f71
>  #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
>  #define PIPEFS_MAGIC            0x50495045
>  #define PROC_SUPER_MAGIC	0x9fa0
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 11b1596e2542..fb0f6394a8bb 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -73,6 +73,9 @@ static struct ctl_table user_table[] = {
>  #ifdef CONFIG_INOTIFY_USER
>  	UCOUNT_ENTRY("max_inotify_instances"),
>  	UCOUNT_ENTRY("max_inotify_watches"),
> +#endif
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	UCOUNT_ENTRY("max_loop_devices"),
>  #endif
>  	{ }
>  };
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/7] kernfs: handle multiple namespace tags
  2020-04-22 14:54 ` [PATCH v2 4/7] kernfs: handle multiple namespace tags Christian Brauner
@ 2020-04-22 22:01   ` Serge E. Hallyn
  0 siblings, 0 replies; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-22 22:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:34PM +0200, Christian Brauner wrote:
> Since [1] kernfs supports namespace tags. This feature is essential to
> enable sysfs to present different views of on various parts depending on
> the namespace tag. For example, the /sys/class/net/ directory will only
> show network devices that belong to the network namespace that sysfs was
> mounted in. This is achieved by stashing a reference to the network
> namespace of the task mounting sysfs in the super block. And when a
> lookup operation is performed on e.g. /sys/class/net/ kernfs will
> compare the network namespace tag of the kernfs_node associated with the
> device and kobject of the network device to the network namespace of the
> network device. This ensures that only network devices owned by the
> network namespace sysfs was mounted in are shown, a feature which is
> essential to containers.
> For loopfs to show correct permissions in sysfs just as with network
> devices we need to be able to tag kernfs_super_info with additional
> namespaces. This extension was even already mentioned in a comment to
> struct kernfs_super_info:
>   /*
>    * Each sb is associated with one namespace tag, currently the
>    * network namespace of the task which mounted this kernfs
>    * instance.  If multiple tags become necessary, make the following
>    * an array and compare kernfs_node tag against every entry.
>    */
> This patch extends the kernfs_super_info and kernfs_fs_context ns
> pointers to fixed-size arrays of namespace tags. The size is taken from
> the namespaces currently supported by kobjects, i.e. we don't extend it
> to cover all namespace but only the ones kernfs needs to support.
> In addition, the kernfs_node struct gains an additional member that
> indicates the type of namespace this kernfs_node was tagged with. This
> allows us to simply retrieve the correct namespace tag from the
> kernfs_fs_context and kernfs_super_info ns array with a simple indexing
> operation. This has the advantage that we can just keep passing down the
> correct namespace instead of passing down the array.
> 
> [1]: 608b4b9548de ("netns: Teach network device kobjects which namespace they are in.")
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Acked-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> unchanged
> ---
>  fs/kernfs/dir.c             |  6 +++---
>  fs/kernfs/kernfs-internal.h |  9 ++++-----
>  fs/kernfs/mount.c           | 11 +++++++----
>  fs/sysfs/mount.c            | 10 +++++-----
>  include/linux/kernfs.h      | 22 ++++++++++++++--------
>  include/linux/sysfs.h       |  8 +++++---
>  lib/kobject.c               |  2 +-
>  7 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 9aec80b9d7c6..1f2d894ae454 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -576,7 +576,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>  
>  	/* The kernfs node has been moved to a different namespace */
>  	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
> -	    kernfs_info(dentry->d_sb)->ns != kn->ns)
> +	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
>  		goto out_bad;
>  
>  	mutex_unlock(&kernfs_mutex);
> @@ -1087,7 +1087,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
>  	mutex_lock(&kernfs_mutex);
>  
>  	if (kernfs_ns_enabled(parent))
> -		ns = kernfs_info(dir->i_sb)->ns;
> +		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
>  
>  	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
>  
> @@ -1673,7 +1673,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>  	mutex_lock(&kernfs_mutex);
>  
>  	if (kernfs_ns_enabled(parent))
> -		ns = kernfs_info(dentry->d_sb)->ns;
> +		ns = kernfs_info(dentry->d_sb)->ns[parent->ns_type];
>  
>  	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
>  	     pos;
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 7ee97ef59184..7c972c00f84a 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -16,6 +16,7 @@
>  #include <linux/xattr.h>
>  
>  #include <linux/kernfs.h>
> +#include <linux/kobject_ns.h>
>  #include <linux/fs_context.h>
>  
>  struct kernfs_iattrs {
> @@ -62,12 +63,10 @@ struct kernfs_super_info {
>  	struct kernfs_root	*root;
>  
>  	/*
> -	 * Each sb is associated with one namespace tag, currently the
> -	 * network namespace of the task which mounted this kernfs
> -	 * instance.  If multiple tags become necessary, make the following
> -	 * an array and compare kernfs_node tag against every entry.
> +	 * Each sb can be associated with namespace tags. They will be used
> +	 * to compare kernfs_node tags against relevant entries.
>  	 */
> -	const void		*ns;
> +	const void		*ns[KOBJ_NS_TYPES];
>  
>  	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
>  	struct list_head	node;
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 9dc7e7a64e10..dc4ee0f0a597 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -279,14 +279,15 @@ static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
>  	struct kernfs_super_info *sb_info = kernfs_info(sb);
>  	struct kernfs_super_info *info = fc->s_fs_info;
>  
> -	return sb_info->root == info->root && sb_info->ns == info->ns;
> +	return sb_info->root == info->root &&
> +	       memcmp(sb_info->ns, info->ns, sizeof(sb_info->ns)) == 0;
>  }
>  
>  static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
>  {
>  	struct kernfs_fs_context *kfc = fc->fs_private;
>  
> -	kfc->ns_tag = NULL;
> +	memset(kfc->ns_tag, 0, sizeof(kfc->ns_tag));
>  	return set_anon_super_fc(sb, fc);
>  }
>  
> @@ -296,7 +297,7 @@ static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
>   *
>   * Return the namespace tag associated with kernfs super_block @sb.
>   */
> -const void *kernfs_super_ns(struct super_block *sb)
> +const void **kernfs_super_ns(struct super_block *sb)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
>  
> @@ -324,7 +325,9 @@ int kernfs_get_tree(struct fs_context *fc)
>  		return -ENOMEM;
>  
>  	info->root = kfc->root;
> -	info->ns = kfc->ns_tag;
> +	BUILD_BUG_ON(sizeof(info->ns) != sizeof(kfc->ns_tag));
> +	memcpy(info->ns, kfc->ns_tag, sizeof(info->ns));
> +
>  	INIT_LIST_HEAD(&info->node);
>  
>  	fc->s_fs_info = info;
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index db81cfbab9d6..5e2ec88a709e 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -41,8 +41,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
>  {
>  	struct kernfs_fs_context *kfc = fc->fs_private;
>  
> -	if (kfc->ns_tag)
> -		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
> +	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
> +		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
>  	kernfs_free_fs_context(fc);
>  	kfree(kfc);
>  }
> @@ -66,7 +66,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
>  	if (!kfc)
>  		return -ENOMEM;
>  
> -	kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> +	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
>  	kfc->root = sysfs_root;
>  	kfc->magic = SYSFS_MAGIC;
>  	fc->fs_private = kfc;
> @@ -81,10 +81,10 @@ static int sysfs_init_fs_context(struct fs_context *fc)
>  
>  static void sysfs_kill_sb(struct super_block *sb)
>  {
> -	void *ns = (void *)kernfs_super_ns(sb);
> +	void **ns = (void **)kernfs_super_ns(sb);
>  
>  	kernfs_kill_sb(sb);
> -	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
> +	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
>  }
>  
>  static struct file_system_type sysfs_fs_type = {
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 89f6a4214a70..d0544f2e0c99 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -16,6 +16,7 @@
>  #include <linux/atomic.h>
>  #include <linux/uidgid.h>
>  #include <linux/wait.h>
> +#include <linux/kobject_ns.h>
>  
>  struct file;
>  struct dentry;
> @@ -137,8 +138,9 @@ struct kernfs_node {
>  
>  	struct rb_node		rb;
>  
> -	const void		*ns;	/* namespace tag */
> -	unsigned int		hash;	/* ns + name hash */
> +	const void		*ns;		/* namespace tag */
> +	enum kobj_ns_type	ns_type;	/* type of namespace tag */
> +	unsigned int		hash;		/* ns + name hash */
>  	union {
>  		struct kernfs_elem_dir		dir;
>  		struct kernfs_elem_symlink	symlink;
> @@ -275,7 +277,7 @@ struct kernfs_ops {
>   */
>  struct kernfs_fs_context {
>  	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
> -	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
> +	void			*ns_tag[KOBJ_NS_TYPES]; /* Namespace tags of the mount (or empty) */
>  	unsigned long		magic;		/* File system specific magic number */
>  
>  	/* The following are set/used by kernfs_mount() */
> @@ -319,17 +321,20 @@ static inline ino_t kernfs_gen(struct kernfs_node *kn)
>  
>  /**
>   * kernfs_enable_ns - enable namespace under a directory
> - * @kn: directory of interest, should be empty
> + * @kn:		directory of interest, should be empty
> + * @ns_type:	type of namespace that should be enabled for this directory
>   *
>   * This is to be called right after @kn is created to enable namespace
>   * under it.  All children of @kn must have non-NULL namespace tags and
>   * only the ones which match the super_block's tag will be visible.
>   */
> -static inline void kernfs_enable_ns(struct kernfs_node *kn)
> +static inline void kernfs_enable_ns(struct kernfs_node *kn,
> +				    enum kobj_ns_type ns_type)
>  {
>  	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
>  	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
>  	kn->flags |= KERNFS_NS;
> +	kn->ns_type = ns_type;
>  }
>  
>  /**
> @@ -401,7 +406,7 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
>  int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
>  		     const void *value, size_t size, int flags);
>  
> -const void *kernfs_super_ns(struct super_block *sb);
> +const void **kernfs_super_ns(struct super_block *sb);
>  int kernfs_get_tree(struct fs_context *fc);
>  void kernfs_free_fs_context(struct fs_context *fc);
>  void kernfs_kill_sb(struct super_block *sb);
> @@ -415,7 +420,8 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
>  static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
>  { return 0; }	/* whatever */
>  
> -static inline void kernfs_enable_ns(struct kernfs_node *kn) { }
> +static inline void kernfs_enable_ns(struct kernfs_node *kn,
> +				    enum kobj_ns_type ns_type) { }
>  
>  static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
>  { return false; }
> @@ -511,7 +517,7 @@ static inline int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
>  				   const void *value, size_t size, int flags)
>  { return -ENOSYS; }
>  
> -static inline const void *kernfs_super_ns(struct super_block *sb)
> +static inline const void **kernfs_super_ns(struct super_block *sb)
>  { return NULL; }
>  
>  static inline int kernfs_get_tree(struct fs_context *fc)
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index 80bb865b3a33..d127b3487abc 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -306,9 +306,10 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
>  
>  int __must_check sysfs_init(void);
>  
> -static inline void sysfs_enable_ns(struct kernfs_node *kn)
> +static inline void sysfs_enable_ns(struct kernfs_node *kn,
> +				   enum kobj_ns_type ns_type)
>  {
> -	return kernfs_enable_ns(kn);
> +	return kernfs_enable_ns(kn, ns_type);
>  }
>  
>  int sysfs_file_change_owner(struct kobject *kobj, const char *name, kuid_t kuid,
> @@ -531,7 +532,8 @@ static inline int __must_check sysfs_init(void)
>  	return 0;
>  }
>  
> -static inline void sysfs_enable_ns(struct kernfs_node *kn)
> +static inline void sysfs_enable_ns(struct kernfs_node *kn,
> +				   enum kobj_ns_type ns_type)
>  {
>  }
>  
> diff --git a/lib/kobject.c b/lib/kobject.c
> index 6f07083cc111..c58c62d49a10 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -120,7 +120,7 @@ static int create_dir(struct kobject *kobj)
>  		BUG_ON(ops->type >= KOBJ_NS_TYPES);
>  		BUG_ON(!kobj_ns_type_registered(ops->type));
>  
> -		sysfs_enable_ns(kobj->sd);
> +		sysfs_enable_ns(kobj->sd, ops->type);
>  	}
>  
>  	return 0;
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] loop: preserve sysfs backwards compatibility
  2020-04-22 14:54 ` [PATCH v2 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
@ 2020-04-23  1:17   ` Serge E. Hallyn
  2020-04-23 11:15     ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-23  1:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:35PM +0200, Christian Brauner wrote:
> For sysfs the initial namespace is special. All devices currently
> propagate into all non-initial namespaces. For example, sysfs is usually
> mounted in a privileged or unprivileged container and all devices are
> visible to the container but are owned by global root. Even though none
> of the propagated files can be used there are still a lot of read-only
> values that are accessed or read by tools running in non-initial
> namespaces. Some devices though, which can be moved or created in
> another namespace, will only show up in the corresponding namespace.
> This currently includes network and loop devices but no other ones.
> Since all current workloads depend on devices from the inital namespace
> being visible this behavior cannot be simply changed. This patch just
> makes sure to keep propagating devices that share the same device class
> with loop devices from the initial namespaces into all non-initial
> namespaces as before. In short, nothing changes only loopfs loop devices
> will be shown in their correct namespace.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Hi,

two comments below:

> ---
> /* v2 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Protect init_net with a CONFIG_NET ifdef in case it is set to "n".
>   - As Tejun pointed out there is argument to be made that a new mount
>     option for sysfs could be added that would change how devices are
>     propagated. This patch does not prevent this but it is an orthogonal
>     problem.
> ---
>  block/genhd.c               | 79 +++++++++++++++++++++++++++++++++++++
>  fs/kernfs/dir.c             | 34 +++++++++++++---
>  fs/kernfs/kernfs-internal.h | 24 +++++++++++
>  fs/sysfs/mount.c            |  4 ++
>  include/linux/genhd.h       |  3 ++
>  include/linux/kernfs.h      | 22 +++++++++++
>  include/linux/kobject_ns.h  |  4 ++
>  lib/kobject.c               |  2 +
>  8 files changed, 167 insertions(+), 5 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 06b642b23a07..b5b2601c4311 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1198,11 +1198,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +static void *user_grab_current_ns(void)
> +{
> +	struct user_namespace *ns = current_user_ns();
> +	return get_user_ns(ns);
> +}
> +
> +static const void *user_initial_ns(void)
> +{
> +	return &init_user_ns;
> +}
> +
> +static void user_put_ns(void *p)
> +{
> +	struct user_namespace *ns = p;
> +	put_user_ns(ns);
> +}
> +
> +static bool user_current_may_mount(void)
> +{
> +	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
> +}
> +
> +const struct kobj_ns_type_operations user_ns_type_operations = {
> +	.type			= KOBJ_NS_TYPE_USER,
> +	.current_may_mount	= user_current_may_mount,
> +	.grab_current_ns	= user_grab_current_ns,
> +	.initial_ns		= user_initial_ns,
> +	.drop_ns		= user_put_ns,
> +};
> +
> +static const void *block_class_user_namespace(struct device *dev)
> +{
> +	struct gendisk *disk;
> +
> +	if (dev->type == &part_type)
> +		disk = part_to_disk(dev_to_part(dev));
> +	else
> +		disk = dev_to_disk(dev);
> +
> +	return disk->user_ns;
> +}
> +
> +static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
> +{
> +	struct gendisk *disk;
> +	struct user_namespace *ns;
> +
> +	if (dev->type == &part_type)
> +		disk = part_to_disk(dev_to_part(dev));
> +	else
> +		disk = dev_to_disk(dev);
> +
> +	ns = disk->user_ns;
> +	if (ns && ns != &init_user_ns) {
> +		kuid_t ns_root_uid = make_kuid(ns, 0);
> +		kgid_t ns_root_gid = make_kgid(ns, 0);
> +
> +		if (uid_valid(ns_root_uid))
> +			*uid = ns_root_uid;
> +
> +		if (gid_valid(ns_root_gid))
> +			*gid = ns_root_gid;
> +	}

You're not setting uid and gid in the else case?

> +}
> +#endif /* CONFIG_BLK_DEV_LOOPFS */
> +
>  static int __init genhd_device_init(void)
>  {
>  	int error;
>  
>  	block_class.dev_kobj = sysfs_dev_block_kobj;
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	kobj_ns_type_register(&user_ns_type_operations);
> +#endif
>  	error = class_register(&block_class);
>  	if (unlikely(error))
>  		return error;
> @@ -1524,8 +1594,14 @@ static void disk_release(struct device *dev)
>  		blk_put_queue(disk->queue);
>  	kfree(disk);
>  }
> +
>  struct class block_class = {
>  	.name		= "block",
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	.ns_type	= &user_ns_type_operations,
> +	.namespace	= block_class_user_namespace,
> +	.get_ownership	= block_class_get_ownership,
> +#endif
>  };
>  
>  static char *block_devnode(struct device *dev, umode_t *mode,
> @@ -1715,6 +1791,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
>  		disk_to_dev(disk)->class = &block_class;
>  		disk_to_dev(disk)->type = &disk_type;
>  		device_initialize(disk_to_dev(disk));
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +		disk->user_ns = &init_user_ns;
> +#endif
>  	}
>  	return disk;
>  }
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 1f2d894ae454..02796ba6521a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
>  		goto out_bad;
>  
>  	/* The kernfs node has been moved to a different namespace */
> -	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
> -	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
> -		goto out_bad;
> +	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
> +		if (kernfs_init_ns_propagates(kn->parent) &&
> +		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
> +			goto out_good;
> +		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
> +			goto out_bad;
> +	}
>  
> +out_good:
>  	mutex_unlock(&kernfs_mutex);
>  	return 1;
>  out_bad:
> @@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
>  		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
>  
>  	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> +	if (!kn && kernfs_init_ns_propagates(parent)) {
> +		ns = kernfs_init_ns(parent->ns_type);
> +		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> +	}
>  
>  	/* no such entry */
>  	if (!kn || !kernfs_active(kn)) {
> @@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
>  	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>  {
> +	const void *init_ns;
> +
>  	if (pos) {
>  		int valid = kernfs_active(pos) &&
>  			pos->parent == parent && hash == pos->hash;
> @@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
>  		if (!valid)
>  			pos = NULL;
>  	}
> +
> +	if (kernfs_init_ns_propagates(parent))
> +		init_ns = kernfs_init_ns(parent->ns_type);
> +	else
> +		init_ns = NULL;
> +
>  	if (!pos && (hash > 1) && (hash < INT_MAX)) {
>  		struct rb_node *node = parent->dir.children.rb_node;
>  		while (node) {
> @@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
>  		}
>  	}
>  	/* Skip over entries which are dying/dead or in the wrong namespace */
> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> +	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
>  		struct rb_node *node = rb_next(&pos->rb);
>  		if (!node)
>  			pos = NULL;
> @@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>  {
>  	pos = kernfs_dir_pos(ns, parent, ino, pos);
>  	if (pos) {
> +		const void *init_ns;
> +		if (kernfs_init_ns_propagates(parent))
> +			init_ns = kernfs_init_ns(parent->ns_type);
> +		else
> +			init_ns = NULL;
> +
>  		do {
>  			struct rb_node *node = rb_next(&pos->rb);
>  			if (!node)
>  				pos = NULL;
>  			else
>  				pos = rb_to_kn(node);
> -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
> +		} while (pos && (!kernfs_active(pos) ||
> +				 (pos->ns != ns && pos->ns != init_ns)));
>  	}
>  	return pos;
>  }
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 7c972c00f84a..74eb6c447361 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -80,6 +80,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
>  	return d_inode(dentry)->i_private;
>  }
>  
> +#ifdef CONFIG_NET
> +extern struct net init_net;
> +#endif
> +
> +extern struct user_namespace init_user_ns;
> +
> +static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
> +{
> +	switch (ns_type) {
> +	case KOBJ_NS_TYPE_NET:
> +#ifdef CONFIG_NET
> +		return &init_net;
> +#else
> +		break;
> +#endif
> +	case KOBJ_NS_TYPE_USER:
> +		return &init_user_ns;
> +	default:
> +		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
> +	}
> +
> +	return NULL;
> +}
> +
>  extern const struct super_operations kernfs_sops;
>  extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
>  
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 5e2ec88a709e..99b82a0ae7ea 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
>  
>  	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
>  		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
> +	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
> +		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
>  	kernfs_free_fs_context(fc);
>  	kfree(kfc);
>  }
> @@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
>  		return -ENOMEM;
>  
>  	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> +	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);

It's nice and tidy this way so maybe worth it, but getting
the kobj_ns_type_lock spinlock twice in a row here seems
unfortunate.

>  	kfc->root = sysfs_root;
>  	kfc->magic = SYSFS_MAGIC;
>  	fc->fs_private = kfc;
> @@ -85,6 +88,7 @@ static void sysfs_kill_sb(struct super_block *sb)
>  
>  	kernfs_kill_sb(sb);
>  	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
> +	kobj_ns_drop(KOBJ_NS_TYPE_USER, ns[KOBJ_NS_TYPE_USER]);
>  }
>  
>  static struct file_system_type sysfs_fs_type = {
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 9b3fffdf4011..4edf8a82e815 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -220,6 +220,9 @@ struct gendisk {
>  	int node_id;
>  	struct badblocks *bb;
>  	struct lockdep_map lockdep_map;
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	struct user_namespace *user_ns;
> +#endif
>  };
>  
>  static inline struct gendisk *part_to_disk(struct hd_struct *part)
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index d0544f2e0c99..7809584a7c8e 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -53,6 +53,7 @@ enum kernfs_node_flag {
>  	KERNFS_SUICIDED		= 0x0800,
>  	KERNFS_EMPTY_DIR	= 0x1000,
>  	KERNFS_HAS_RELEASE	= 0x2000,
> +	KERNFS_NS_PROPAGATE	= 0x4000,
>  };
>  
>  /* @flags for kernfs_create_root() */
> @@ -337,6 +338,27 @@ static inline void kernfs_enable_ns(struct kernfs_node *kn,
>  	kn->ns_type = ns_type;
>  }
>  
> +static inline void kernfs_enable_init_ns_propagates(struct kernfs_node *kn)
> +{
> +	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
> +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
> +	WARN_ON_ONCE(!(kn->flags & KERNFS_NS));
> +	kn->flags |= KERNFS_NS_PROPAGATE;
> +}
> +
> +/**
> + * kernfs_init_ns_propagates - test whether init ns propagates
> + * @kn: the node to test
> + *
> + * Test whether kernfs entries created in the init namespace propagate into
> + * other namespaces.
> + */
> +static inline bool kernfs_init_ns_propagates(const struct kernfs_node *kn)
> +{
> +	return ((kn->flags & (KERNFS_NS | KERNFS_NS_PROPAGATE)) ==
> +		(KERNFS_NS | KERNFS_NS_PROPAGATE));
> +}
> +
>  /**
>   * kernfs_ns_enabled - test whether namespace is enabled
>   * @kn: the node to test
> diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
> index 991a9286bcea..a9c45bcce235 100644
> --- a/include/linux/kobject_ns.h
> +++ b/include/linux/kobject_ns.h
> @@ -26,6 +26,7 @@ struct kobject;
>  enum kobj_ns_type {
>  	KOBJ_NS_TYPE_NONE = 0,
>  	KOBJ_NS_TYPE_NET,
> +	KOBJ_NS_TYPE_USER,
>  	KOBJ_NS_TYPES
>  };
>  
> @@ -34,6 +35,8 @@ enum kobj_ns_type {
>   *   @grab_current_ns: return a new reference to calling task's namespace
>   *   @initial_ns: return the initial namespace (i.e. init_net_ns)
>   *   @drop_ns: drops a reference to namespace
> + *   @initial_ns_propagates: whether devices in the initial namespace propagate
> + *			to all other namespaces
>   */
>  struct kobj_ns_type_operations {
>  	enum kobj_ns_type type;
> @@ -41,6 +44,7 @@ struct kobj_ns_type_operations {
>  	void *(*grab_current_ns)(void);
>  	const void *(*initial_ns)(void);
>  	void (*drop_ns)(void *);
> +	bool (*initial_ns_propagates)(void);
>  };
>  
>  int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
> diff --git a/lib/kobject.c b/lib/kobject.c
> index c58c62d49a10..96bb8c732d1c 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -121,6 +121,8 @@ static int create_dir(struct kobject *kobj)
>  		BUG_ON(!kobj_ns_type_registered(ops->type));
>  
>  		sysfs_enable_ns(kobj->sd, ops->type);
> +		if (ops->initial_ns_propagates && ops->initial_ns_propagates())
> +			kernfs_enable_init_ns_propagates(kobj->sd);
>  	}
>  
>  	return 0;
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add()
  2020-04-22 14:54 ` [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
@ 2020-04-23  1:36   ` Serge E. Hallyn
  0 siblings, 0 replies; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-23  1:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:36PM +0200, Christian Brauner wrote:
> Tag loop devices with the namespace the loopfs instance was mounted in.
> This has the consequence that loopfs devices carry the correct sysfs
> permissions for all their core files. All other devices files will
> continue to be correctly owned by the initial namespaces. Here is sample
> output:
> 
> root@b1:~# mount -t loop loop /mnt
> root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
> root@b1:~# losetup -f
> /dev/loop8
> root@b1:~# ln -sf /mnt/loop8 /dev/loop8
> root@b1:~# ls -al /sys/class/block/loop8
> lrwxrwxrwx 1 root root 0 Apr  7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
> root@b1:~# ls -al /sys/class/block/loop8/
> total 0
> drwxr-xr-x  9 root   root       0 Apr  7 13:06 .
> drwxr-xr-x 18 nobody nogroup    0 Apr  7 13:07 ..
> -r--r--r--  1 root   root    4096 Apr  7 13:06 alignment_offset
> lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 bdi -> ../../bdi/7:8
> -r--r--r--  1 root   root    4096 Apr  7 13:06 capability
> -r--r--r--  1 root   root    4096 Apr  7 13:06 dev
> -r--r--r--  1 root   root    4096 Apr  7 13:06 discard_alignment
> -r--r--r--  1 root   root    4096 Apr  7 13:06 events
> -r--r--r--  1 root   root    4096 Apr  7 13:06 events_async
> -rw-r--r--  1 root   root    4096 Apr  7 13:06 events_poll_msecs
> -r--r--r--  1 root   root    4096 Apr  7 13:06 ext_range
> -r--r--r--  1 root   root    4096 Apr  7 13:06 hidden
> drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 holders
> -r--r--r--  1 root   root    4096 Apr  7 13:06 inflight
> drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 integrity
> drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 mq
> drwxr-xr-x  2 root   root       0 Apr  7 13:06 power
> drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 queue
> -r--r--r--  1 root   root    4096 Apr  7 13:06 range
> -r--r--r--  1 root   root    4096 Apr  7 13:06 removable
> -r--r--r--  1 root   root    4096 Apr  7 13:06 ro
> -r--r--r--  1 root   root    4096 Apr  7 13:06 size
> drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 slaves
> -r--r--r--  1 root   root    4096 Apr  7 13:06 stat
> lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 subsystem -> ../../../../class/block
> drwxr-xr-x  2 root   root       0 Apr  7 13:06 trace
> -rw-r--r--  1 root   root    4096 Apr  7 13:06 uevent
> root@b1:~#
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

I was a *bit* worried about not taking a reference to the
user namespace, but it doesn't look like the chain of
loop_remove() -> del_gendisk() -> device_del() will allow any later
access through sysfs, so I guess it's fine.

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> unchanged
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Adapted commit message otherwise unchanged.
> ---
>  drivers/block/loop.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 8e21d4b33e01..2dc53bad4b48 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2212,6 +2212,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
>  	disk->private_data	= lo;
>  	disk->queue		= lo->lo_queue;
>  	sprintf(disk->disk_name, "loop%d", i);
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> +	if (loopfs_i_sb(inode))
> +		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
> +#endif
>  
>  	add_disk(disk);
>  
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/7] loopfs: only show devices in their correct instance
  2020-04-22 14:54 ` [PATCH v2 7/7] loopfs: only show devices in their correct instance Christian Brauner
@ 2020-04-23  1:37   ` Serge E. Hallyn
  0 siblings, 0 replies; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-23  1:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:54:37PM +0200, Christian Brauner wrote:
> Since loopfs devices belong to a loopfs instance they have no business
> polluting the host's devtmpfs mount and should not propagate out of the
> namespace they belong to.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> unchanged
> ---
>  drivers/base/devtmpfs.c | 4 ++--
>  drivers/block/loop.c    | 4 +++-
>  include/linux/device.h  | 3 +++
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> index c9017e0584c0..77371ceb88fa 100644
> --- a/drivers/base/devtmpfs.c
> +++ b/drivers/base/devtmpfs.c
> @@ -111,7 +111,7 @@ int devtmpfs_create_node(struct device *dev)
>  	const char *tmp = NULL;
>  	struct req req;
>  
> -	if (!thread)
> +	if (!thread || dev->no_devnode)
>  		return 0;
>  
>  	req.mode = 0;
> @@ -138,7 +138,7 @@ int devtmpfs_delete_node(struct device *dev)
>  	const char *tmp = NULL;
>  	struct req req;
>  
> -	if (!thread)
> +	if (!thread || dev->no_devnode)
>  		return 0;
>  
>  	req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 2dc53bad4b48..5548151b9f11 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2213,8 +2213,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
>  	disk->queue		= lo->lo_queue;
>  	sprintf(disk->disk_name, "loop%d", i);
>  #ifdef CONFIG_BLK_DEV_LOOPFS
> -	if (loopfs_i_sb(inode))
> +	if (loopfs_i_sb(inode)) {
>  		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
> +		disk_to_dev(disk)->no_devnode = true;
> +	}
>  #endif
>  
>  	add_disk(disk);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index ac8e37cd716a..c69ef1c5a0ef 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -523,6 +523,8 @@ struct dev_links_info {
>   *		  sync_state() callback.
>   * @dma_coherent: this particular device is dma coherent, even if the
>   *		architecture supports non-coherent devices.
> + * @no_devnode: whether device nodes associated with this device are kept out
> + *		of devtmpfs (e.g. due to separate filesystem)
>   *
>   * At the lowest level, every device in a Linux system is represented by an
>   * instance of struct device. The device structure contains the information
> @@ -622,6 +624,7 @@ struct device {
>      defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
>  	bool			dma_coherent:1;
>  #endif
> +	bool			no_devnode:1;
>  };
>  
>  static inline struct device *kobj_to_dev(struct kobject *kobj)
> -- 
> 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] loop: preserve sysfs backwards compatibility
  2020-04-23  1:17   ` Serge E. Hallyn
@ 2020-04-23 11:15     ` Christian Brauner
  2020-04-23 16:13       ` Serge E. Hallyn
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-23 11:15 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev,
	Steve Barber, Dylan Reid, Filipe Brandenburger, Kees Cook,
	Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 08:17:06PM -0500, Serge Hallyn wrote:
> On Wed, Apr 22, 2020 at 04:54:35PM +0200, Christian Brauner wrote:
> > For sysfs the initial namespace is special. All devices currently
> > propagate into all non-initial namespaces. For example, sysfs is usually
> > mounted in a privileged or unprivileged container and all devices are
> > visible to the container but are owned by global root. Even though none
> > of the propagated files can be used there are still a lot of read-only
> > values that are accessed or read by tools running in non-initial
> > namespaces. Some devices though, which can be moved or created in
> > another namespace, will only show up in the corresponding namespace.
> > This currently includes network and loop devices but no other ones.
> > Since all current workloads depend on devices from the inital namespace
> > being visible this behavior cannot be simply changed. This patch just
> > makes sure to keep propagating devices that share the same device class
> > with loop devices from the initial namespaces into all non-initial
> > namespaces as before. In short, nothing changes only loopfs loop devices
> > will be shown in their correct namespace.
> > 
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Hi,
> 
> two comments below:
> 
> > ---
> > /* v2 */
> > - Christian Brauner <christian.brauner@ubuntu.com>:
> >   - Protect init_net with a CONFIG_NET ifdef in case it is set to "n".
> >   - As Tejun pointed out there is argument to be made that a new mount
> >     option for sysfs could be added that would change how devices are
> >     propagated. This patch does not prevent this but it is an orthogonal
> >     problem.
> > ---
> >  block/genhd.c               | 79 +++++++++++++++++++++++++++++++++++++
> >  fs/kernfs/dir.c             | 34 +++++++++++++---
> >  fs/kernfs/kernfs-internal.h | 24 +++++++++++
> >  fs/sysfs/mount.c            |  4 ++
> >  include/linux/genhd.h       |  3 ++
> >  include/linux/kernfs.h      | 22 +++++++++++
> >  include/linux/kobject_ns.h  |  4 ++
> >  lib/kobject.c               |  2 +
> >  8 files changed, 167 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 06b642b23a07..b5b2601c4311 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -1198,11 +1198,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
> >  	return NULL;
> >  }
> >  
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +static void *user_grab_current_ns(void)
> > +{
> > +	struct user_namespace *ns = current_user_ns();
> > +	return get_user_ns(ns);
> > +}
> > +
> > +static const void *user_initial_ns(void)
> > +{
> > +	return &init_user_ns;
> > +}
> > +
> > +static void user_put_ns(void *p)
> > +{
> > +	struct user_namespace *ns = p;
> > +	put_user_ns(ns);
> > +}
> > +
> > +static bool user_current_may_mount(void)
> > +{
> > +	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
> > +}
> > +
> > +const struct kobj_ns_type_operations user_ns_type_operations = {
> > +	.type			= KOBJ_NS_TYPE_USER,
> > +	.current_may_mount	= user_current_may_mount,
> > +	.grab_current_ns	= user_grab_current_ns,
> > +	.initial_ns		= user_initial_ns,
> > +	.drop_ns		= user_put_ns,
> > +};
> > +
> > +static const void *block_class_user_namespace(struct device *dev)
> > +{
> > +	struct gendisk *disk;
> > +
> > +	if (dev->type == &part_type)
> > +		disk = part_to_disk(dev_to_part(dev));
> > +	else
> > +		disk = dev_to_disk(dev);
> > +
> > +	return disk->user_ns;
> > +}
> > +
> > +static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
> > +{
> > +	struct gendisk *disk;
> > +	struct user_namespace *ns;
> > +
> > +	if (dev->type == &part_type)
> > +		disk = part_to_disk(dev_to_part(dev));
> > +	else
> > +		disk = dev_to_disk(dev);
> > +
> > +	ns = disk->user_ns;
> > +	if (ns && ns != &init_user_ns) {
> > +		kuid_t ns_root_uid = make_kuid(ns, 0);
> > +		kgid_t ns_root_gid = make_kgid(ns, 0);
> > +
> > +		if (uid_valid(ns_root_uid))
> > +			*uid = ns_root_uid;
> > +
> > +		if (gid_valid(ns_root_gid))
> > +			*gid = ns_root_gid;
> > +	}
> 
> You're not setting uid and gid in the else case?

Right, the reason being that sysfs and the associated kobject
infrastructure will always set global root as the default. So the
callchain is:
kobject_get_ownership()
and this calls the ktype callbacks which hits
-> device_get_ownership()
which calls into the device class specific callbacks which in this is
case calls block_class_get_ownership().

And there's no direct callers of, say <device-class>->get_ownership()
that all needs to always go through the callback infrastructure.

> 
> > +}
> > +#endif /* CONFIG_BLK_DEV_LOOPFS */
> > +
> >  static int __init genhd_device_init(void)
> >  {
> >  	int error;
> >  
> >  	block_class.dev_kobj = sysfs_dev_block_kobj;
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	kobj_ns_type_register(&user_ns_type_operations);
> > +#endif
> >  	error = class_register(&block_class);
> >  	if (unlikely(error))
> >  		return error;
> > @@ -1524,8 +1594,14 @@ static void disk_release(struct device *dev)
> >  		blk_put_queue(disk->queue);
> >  	kfree(disk);
> >  }
> > +
> >  struct class block_class = {
> >  	.name		= "block",
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	.ns_type	= &user_ns_type_operations,
> > +	.namespace	= block_class_user_namespace,
> > +	.get_ownership	= block_class_get_ownership,
> > +#endif
> >  };
> >  
> >  static char *block_devnode(struct device *dev, umode_t *mode,
> > @@ -1715,6 +1791,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
> >  		disk_to_dev(disk)->class = &block_class;
> >  		disk_to_dev(disk)->type = &disk_type;
> >  		device_initialize(disk_to_dev(disk));
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +		disk->user_ns = &init_user_ns;
> > +#endif
> >  	}
> >  	return disk;
> >  }
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 1f2d894ae454..02796ba6521a 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
> >  		goto out_bad;
> >  
> >  	/* The kernfs node has been moved to a different namespace */
> > -	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
> > -	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
> > -		goto out_bad;
> > +	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
> > +		if (kernfs_init_ns_propagates(kn->parent) &&
> > +		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
> > +			goto out_good;
> > +		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
> > +			goto out_bad;
> > +	}
> >  
> > +out_good:
> >  	mutex_unlock(&kernfs_mutex);
> >  	return 1;
> >  out_bad:
> > @@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
> >  		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
> >  
> >  	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> > +	if (!kn && kernfs_init_ns_propagates(parent)) {
> > +		ns = kernfs_init_ns(parent->ns_type);
> > +		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> > +	}
> >  
> >  	/* no such entry */
> >  	if (!kn || !kernfs_active(kn)) {
> > @@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
> >  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >  	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >  {
> > +	const void *init_ns;
> > +
> >  	if (pos) {
> >  		int valid = kernfs_active(pos) &&
> >  			pos->parent == parent && hash == pos->hash;
> > @@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >  		if (!valid)
> >  			pos = NULL;
> >  	}
> > +
> > +	if (kernfs_init_ns_propagates(parent))
> > +		init_ns = kernfs_init_ns(parent->ns_type);
> > +	else
> > +		init_ns = NULL;
> > +
> >  	if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >  		struct rb_node *node = parent->dir.children.rb_node;
> >  		while (node) {
> > @@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >  		}
> >  	}
> >  	/* Skip over entries which are dying/dead or in the wrong namespace */
> > -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> > +	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
> >  		struct rb_node *node = rb_next(&pos->rb);
> >  		if (!node)
> >  			pos = NULL;
> > @@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> >  {
> >  	pos = kernfs_dir_pos(ns, parent, ino, pos);
> >  	if (pos) {
> > +		const void *init_ns;
> > +		if (kernfs_init_ns_propagates(parent))
> > +			init_ns = kernfs_init_ns(parent->ns_type);
> > +		else
> > +			init_ns = NULL;
> > +
> >  		do {
> >  			struct rb_node *node = rb_next(&pos->rb);
> >  			if (!node)
> >  				pos = NULL;
> >  			else
> >  				pos = rb_to_kn(node);
> > -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
> > +		} while (pos && (!kernfs_active(pos) ||
> > +				 (pos->ns != ns && pos->ns != init_ns)));
> >  	}
> >  	return pos;
> >  }
> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> > index 7c972c00f84a..74eb6c447361 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > @@ -80,6 +80,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
> >  	return d_inode(dentry)->i_private;
> >  }
> >  
> > +#ifdef CONFIG_NET
> > +extern struct net init_net;
> > +#endif
> > +
> > +extern struct user_namespace init_user_ns;
> > +
> > +static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
> > +{
> > +	switch (ns_type) {
> > +	case KOBJ_NS_TYPE_NET:
> > +#ifdef CONFIG_NET
> > +		return &init_net;
> > +#else
> > +		break;
> > +#endif
> > +	case KOBJ_NS_TYPE_USER:
> > +		return &init_user_ns;
> > +	default:
> > +		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> >  extern const struct super_operations kernfs_sops;
> >  extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
> >  
> > diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> > index 5e2ec88a709e..99b82a0ae7ea 100644
> > --- a/fs/sysfs/mount.c
> > +++ b/fs/sysfs/mount.c
> > @@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
> >  
> >  	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
> >  		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
> > +	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
> > +		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
> >  	kernfs_free_fs_context(fc);
> >  	kfree(kfc);
> >  }
> > @@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
> >  		return -ENOMEM;
> >  
> >  	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> > +	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
> 
> It's nice and tidy this way so maybe worth it, but getting
> the kobj_ns_type_lock spinlock twice in a row here seems
> unfortunate.

Let me see if I can do something non-ugly and moderately simple about
this. If not, it's probably fine as it is since it only happens on sysfs
mount.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] loopfs: implement loopfs
  2020-04-22 21:52   ` Serge E. Hallyn
@ 2020-04-23 11:24     ` Christian Brauner
  2020-04-23 16:17       ` Serge E. Hallyn
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2020-04-23 11:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev,
	Steve Barber, Dylan Reid, Filipe Brandenburger, Kees Cook,
	Benjamin Elder, Akihiro Suda

On Wed, Apr 22, 2020 at 04:52:13PM -0500, Serge Hallyn wrote:
> On Wed, Apr 22, 2020 at 04:54:32PM +0200, Christian Brauner wrote:
> > This implements loopfs, a loop device filesystem. It takes inspiration
> > from the binderfs filesystem I implemented about two years ago and with
> > which we had overall good experiences so far. Parts of it are also
> > based on [3] but it's mostly a new, imho cleaner approach.
> > 
> > Loopfs allows to create private loop devices instances to applications
> > for various use-cases. It covers the use-case that was expressed on-list
> > and in-person to get programmatic access to private loop devices for
> > image building in sandboxes. An illustration for this is provided in
> > [4].
> > 
> > Also loopfs is intended to provide loop devices to privileged and
> > unprivileged containers which has been a frequent request from various
> > major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
> > providing a non-exhaustive list of issues and requests (cf. [5]) around
> > this feature mainly to illustrate that I'm not making the use-cases up.
> > Currently none of this can be done safely since handing a loop device
> > from the host into a container means that the container can see anything
> > that the host is doing with that loop device and what other containers
> > are doing with that device too. And (bind-)mounting devtmpfs inside of
> > containers is not secure at all so also not an option (though sometimes
> > done out of despair apparently).
> > 
> > The workloads people run in containers are supposed to be indiscernible
> > from workloads run on the host and the tools inside of the container are
> > supposed to not be required to be aware that they are running inside a
> > container apart from containerization tools themselves. This is
> > especially true when running older distros in containers that did exist
> > before containers were as ubiquitous as they are today. With loopfs user
> > can call mount -o loop and in a correctly setup container things work
> > the same way they would on the host. The filesystem representation
> > allows us to do this in a very simple way. At container setup, a
> > container manager can mount a private instance of loopfs somehwere, e.g.
> > at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
> > to /dev/loop-control, pre allocate and symlink the number of standard
> > devices into their standard location and have a service file or rules in
> > place that symlink additionally allocated loop devices through losetup
> > into place as well.
> > With the new syscall interception logic this is also possible for
> > unprivileged containers. In these cases when a user calls mount -o loop
> > <image> <mountpoint> it will be possible to completely setup the loop
> > device in the container. The final mount syscall is handled through
> > syscall interception which we already implemented and released in
> > earlier kernels (see [1] and [2]) and is actively used in production
> > workloads. The mount is often rewritten to a fuse binary to provide safe
> > access for unprivileged containers.
> > 
> > Loopfs also allows the creation of hidden/detached dynamic loop devices
> > and associated mounts which also was a often issued request. With the
> > old mount api this can be achieved by creating a temporary loopfs and
> > stashing a file descriptor to the mount point and the loop-control
> > device and immediately unmounting the loopfs instance.  With the new
> > mount api a detached mount can be created directly (i.e. a mount not
> > visible anywhere in the filesystem). New loop devices can then be
> > allocated and configured. They can be mounted through
> > /proc/self/<fd>/<nr> with the old mount api or by using the fd directly
> > with the new mount api. Combined with a mount namespace this allows for
> > fully auto-cleaned up loop devices on program crash. This ties back to
> > various use-cases and is illustrated in [4].
> > 
> > The filesystem representation requires the standard boilerplate
> > filesystem code we know from other tiny filesystems. And all of
> > the loopfs code is hidden under a config option that defaults to false.
> > This specifically means, that none of the code even exists when users do
> > not have any use-case for loopfs.
> > In addition, the loopfs code does not alter how loop devices behave at
> > all, i.e. there are no changes to any existing workloads and I've taken
> > care to ifdef all loopfs specific things out.
> > 
> > Each loopfs mount is a separate instance. As such loop devices created
> > in one instance are independent of loop devices created in another
> > instance. This specifically entails that loop devices are only visible
> > in the loopfs instance they belong to.
> > 
> > The number of loop devices available in loopfs instances are
> > hierarchically limited through /proc/sys/user/max_loop_devices via the
> > ucount infrastructure (Thanks to David Rheinsberg for pointing out that
> > missing piece.). An administrator could e.g. set
> > echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
> > instance mounted by uid x can only create 3 loop devices no matter how
> > many loopfs instances they mount. This limit applies hierarchically to
> > all user namespaces.
> 
> Hm, info->device_count is per loopfs mount, though, right?  I don't
> see where this gets incremented for all of a user's loopfs mounts
> when one adds a loopdev?
> 
> I'm sure I'm missing something obvious...

Hm, I think you might be mixing up the two limits? device_count
corresponds to the "max" mount option and is not involved in enforcing
hierarchical limits. The global restriction is enforced through
inc_ucount() which tracks by the uid of the mounter of the superblock.
If the same user mounts multiple loopfs instances in the same namespace
the ucount infra will enforce his quota across all loopfs instances.

> 
> > In addition, loopfs has a "max" mount option which allows to set a limit
> > on the number of loop devices for a given loopfs instance. This is
> > mainly to cover use-cases where a single loopfs mount is shared as a
> > bind-mount between multiple parties that are prevented from creating
> > other loopfs mounts and is equivalent to the semantics of the binderfs
> > and devpts "max" mount option.
> > 
> > Note that in __loop_clr_fd() we now need not just check whether bdev is
> > valid but also whether bdev->bd_disk is valid. This wasn't necessary
> > before because in order to call LOOP_CLR_FD the loop device would need
> > to be open and thus bdev->bd_disk was guaranteed to be allocated. For
> > loopfs loop devices we allow callers to simply unlink them just as we do
> > for binderfs binder devices and we do also need to account for the case
> > where a loopfs superblock is shutdown while backing files might still be
> > associated with some loop devices. In such cases no bd_disk device will
> > be attached to bdev. This is not in itself noteworthy it's more about
> > documenting the "why" of the added bdev->bd_disk check for posterity.
> > 
> > [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> > [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
> > [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
> > [4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
> > [5]: https://github.com/kubernetes-sigs/kind/issues/1333
> >      https://github.com/kubernetes-sigs/kind/issues/1248
> >      https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
> >      https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
> >      https://gitlab.com/gitlab-com/support-forum/issues/3732
> >      https://github.com/moby/moby/issues/27886
> >      https://twitter.com/_AkihiroSuda_/status/1249664478267854848
> >      https://serverfault.com/questions/701384/loop-device-in-a-linux-container
> >      https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
> >      https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Steve Barber <smbarber@google.com>
> > Cc: Filipe Brandenburger <filbranden@gmail.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Benjamin Elder <bentheelder@google.com>
> > Cc: Seth Forshee <seth.forshee@canonical.com>
> > Cc: Stéphane Graber <stgraber@ubuntu.com>
> > Cc: Tom Gundersen <teg@jklm.no>
> > Cc: Serge Hallyn <serge@hallyn.com>
> 
> Reviewed-by: Serge Hallyn <serge@hallyn.com>
> 
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Christian Kellner <ckellner@redhat.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Dylan Reid <dgreid@google.com>
> > Cc: David Rheinsberg <david.rheinsberg@gmail.com>
> > Cc: Akihiro Suda <suda.kyoto@gmail.com>
> > Cc: Dmitry Vyukov <dvyukov@google.com>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> > /* v2 */
> > - David Rheinsberg <david.rheinsberg@gmail.com> /
> >   Christian Brauner <christian.brauner@ubuntu.com>:
> >   - Correctly cleanup loop devices that are in-use after the loopfs
> >     instance has been shut down. This is important for some use-cases
> >     that David pointed out where they effectively create a loopfs
> >     instance, allocate devices and drop unnecessary references to it.
> > - Christian Brauner <christian.brauner@ubuntu.com>:
> >   - Replace lo_loopfs_i inode member in struct loop_device with a custom
> >     struct lo_info pointer which is only allocated for loopfs loop
> >     devices.
> > ---
> >  MAINTAINERS                    |   5 +
> >  drivers/block/Kconfig          |   4 +
> >  drivers/block/Makefile         |   1 +
> >  drivers/block/loop.c           | 200 ++++++++++---
> >  drivers/block/loop.h           |  12 +-
> >  drivers/block/loopfs/Makefile  |   3 +
> >  drivers/block/loopfs/loopfs.c  | 494 +++++++++++++++++++++++++++++++++
> >  drivers/block/loopfs/loopfs.h  |  36 +++
> >  include/linux/user_namespace.h |   3 +
> >  include/uapi/linux/magic.h     |   1 +
> >  kernel/ucount.c                |   3 +
> >  11 files changed, 721 insertions(+), 41 deletions(-)
> >  create mode 100644 drivers/block/loopfs/Makefile
> >  create mode 100644 drivers/block/loopfs/loopfs.c
> >  create mode 100644 drivers/block/loopfs/loopfs.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index b816a453b10e..560b37a65bce 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -9957,6 +9957,11 @@ W:	http://www.avagotech.com/support/
> >  F:	drivers/message/fusion/
> >  F:	drivers/scsi/mpt3sas/
> >  
> > +LOOPFS FILE SYSTEM
> > +M:	Christian Brauner <christian.brauner@ubuntu.com>
> > +S:	Supported
> > +F:	drivers/block/loopfs/
> > +
> >  LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers
> >  M:	Matthew Wilcox <willy@infradead.org>
> >  L:	linux-scsi@vger.kernel.org
> > diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> > index 025b1b77b11a..d7ff37d795ad 100644
> > --- a/drivers/block/Kconfig
> > +++ b/drivers/block/Kconfig
> > @@ -214,6 +214,10 @@ config BLK_DEV_LOOP
> >  
> >  	  Most users will answer N here.
> >  
> > +config BLK_DEV_LOOPFS
> > +	bool "Loopback device virtual filesystem support"
> > +	depends on BLK_DEV_LOOP=y
> > +
> >  config BLK_DEV_LOOP_MIN_COUNT
> >  	int "Number of loop devices to pre-create at init time"
> >  	depends on BLK_DEV_LOOP
> > diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> > index 795facd8cf19..7052be26aa8b 100644
> > --- a/drivers/block/Makefile
> > +++ b/drivers/block/Makefile
> > @@ -36,6 +36,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
> >  obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
> >  obj-$(CONFIG_BLK_DEV_RBD)     += rbd.o
> >  obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
> > +obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs/
> >  
> >  obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
> >  obj-$(CONFIG_ZRAM) += zram/
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index da693e6a834e..52f7583dd17d 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -81,6 +81,10 @@
> >  
> >  #include "loop.h"
> >  
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +#include "loopfs/loopfs.h"
> > +#endif
> > +
> >  #include <linux/uaccess.h>
> >  
> >  static DEFINE_IDR(loop_index_idr);
> > @@ -1115,6 +1119,24 @@ loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer,
> >  	return err;
> >  }
> >  
> > +static void loop_remove(struct loop_device *lo)
> > +{
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	loopfs_remove(lo);
> > +#endif
> > +	del_gendisk(lo->lo_disk);
> > +	blk_cleanup_queue(lo->lo_queue);
> > +	blk_mq_free_tag_set(&lo->tag_set);
> > +	put_disk(lo->lo_disk);
> > +	kfree(lo);
> > +}
> > +
> > +static inline void __loop_remove(struct loop_device *lo)
> > +{
> > +	idr_remove(&loop_index_idr, lo->lo_number);
> > +	loop_remove(lo);
> > +}
> > +
> >  static int __loop_clr_fd(struct loop_device *lo, bool release)
> >  {
> >  	struct file *filp = NULL;
> > @@ -1164,7 +1186,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> >  	}
> >  	set_capacity(lo->lo_disk, 0);
> >  	loop_sysfs_exit(lo);
> > -	if (bdev) {
> > +	if (bdev && bdev->bd_disk) {
> >  		bd_set_size(bdev, 0);
> >  		/* let user-space know about this change */
> >  		kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
> > @@ -1174,7 +1196,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> >  	module_put(THIS_MODULE);
> >  	blk_mq_unfreeze_queue(lo->lo_queue);
> >  
> > -	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
> > +	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk;
> >  	lo_number = lo->lo_number;
> >  	loop_unprepare_queue(lo);
> >  out_unlock:
> > @@ -1213,7 +1235,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> >  	lo->lo_flags = 0;
> >  	if (!part_shift)
> >  		lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
> > -	lo->lo_state = Lo_unbound;
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	if (loopfs_wants_remove(lo))
> > +		__loop_remove(lo);
> > +	else
> > +#endif
> > +		lo->lo_state = Lo_unbound;
> >  	mutex_unlock(&loop_ctl_mutex);
> >  
> >  	/*
> > @@ -1259,6 +1286,74 @@ static int loop_clr_fd(struct loop_device *lo)
> >  	return __loop_clr_fd(lo, false);
> >  }
> >  
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +int loopfs_rundown_locked(struct loop_device *lo)
> > +{
> > +	int ret;
> > +
> > +	if (WARN_ON_ONCE(!loopfs_device(lo)))
> > +		return -EINVAL;
> > +
> > +	ret = mutex_lock_killable(&loop_ctl_mutex);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) {
> > +		ret = -EBUSY;
> > +	} else {
> > +		/*
> > +		 * Since the device is unbound it has no associated backing
> > +		 * file and we can safely set Lo_rundown to prevent it from
> > +		 * being found. Actual cleanup happens during inode eviction.
> > +		 */
> > +		lo->lo_state = Lo_rundown;
> > +		ret = 0;
> > +	}
> > +
> > +	mutex_unlock(&loop_ctl_mutex);
> > +	return ret;
> > +}
> > +
> > +/**
> > + * loopfs_evict_locked() - remove loop device or mark inactive
> > + * @lo:	loopfs loop device
> > + *
> > + * This function will remove a loop device. If it has no users
> > + * and is bound the backing file will be cleaned up. If the loop
> > + * device has users it will be marked for auto cleanup.
> > + * This function is only called when a loopfs instance is shutdown
> > + * when all references to it from this loopfs instance have been
> > + * dropped. If there are still any references to it cleanup will
> > + * happen in lo_release().
> > + */
> > +void loopfs_evict_locked(struct loop_device *lo)
> > +{
> > +	struct lo_loopfs *lo_info;
> > +	struct inode *lo_inode;
> > +
> > +	WARN_ON_ONCE(!loopfs_device(lo));
> > +
> > +	mutex_lock(&loop_ctl_mutex);
> > +	lo_info = lo->lo_info;
> > +	lo_inode = lo_info->lo_inode;
> > +	lo_info->lo_inode = NULL;
> > +	lo_info->lo_flags |= LOOPFS_FLAGS_INACTIVE;
> > +
> > +	if (atomic_read(&lo->lo_refcnt) > 0) {
> > +		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
> > +	} else {
> > +		lo->lo_state = Lo_rundown;
> > +		lo->lo_disk->private_data = NULL;
> > +		lo_inode->i_private = NULL;
> > +
> > +		mutex_unlock(&loop_ctl_mutex);
> > +		__loop_clr_fd(lo, false);
> > +		return;
> > +	}
> > +	mutex_unlock(&loop_ctl_mutex);
> > +}
> > +#endif /* CONFIG_BLK_DEV_LOOPFS */
> > +
> >  static int
> >  loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
> >  {
> > @@ -1842,7 +1937,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
> >  
> >  	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
> >  		if (lo->lo_state != Lo_bound)
> > -			goto out_unlock;
> > +			goto out_remove;
> >  		lo->lo_state = Lo_rundown;
> >  		mutex_unlock(&loop_ctl_mutex);
> >  		/*
> > @@ -1860,6 +1955,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
> >  		blk_mq_unfreeze_queue(lo->lo_queue);
> >  	}
> >  
> > +out_remove:
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	if (lo->lo_state != Lo_bound && loopfs_wants_remove(lo))
> > +		__loop_remove(lo);
> > +#endif
> > +
> >  out_unlock:
> >  	mutex_unlock(&loop_ctl_mutex);
> >  }
> > @@ -1878,6 +1979,11 @@ static const struct block_device_operations lo_fops = {
> >   * And now the modules code and kernel interface.
> >   */
> >  static int max_loop;
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +unsigned long max_devices;
> > +#else
> > +static unsigned long max_devices;
> > +#endif
> >  module_param(max_loop, int, 0444);
> >  MODULE_PARM_DESC(max_loop, "Maximum number of loop devices");
> >  module_param(max_part, int, 0444);
> > @@ -2006,7 +2112,7 @@ static const struct blk_mq_ops loop_mq_ops = {
> >  	.complete	= lo_complete_rq,
> >  };
> >  
> > -static int loop_add(struct loop_device **l, int i)
> > +static int loop_add(struct loop_device **l, int i, struct inode *inode)
> >  {
> >  	struct loop_device *lo;
> >  	struct gendisk *disk;
> > @@ -2096,7 +2202,17 @@ static int loop_add(struct loop_device **l, int i)
> >  	disk->private_data	= lo;
> >  	disk->queue		= lo->lo_queue;
> >  	sprintf(disk->disk_name, "loop%d", i);
> > +
> >  	add_disk(disk);
> > +
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	err = loopfs_add(lo, inode, disk_devt(disk));
> > +	if (err) {
> > +		__loop_remove(lo);
> > +		goto out;
> > +	}
> > +#endif
> > +
> >  	*l = lo;
> >  	return lo->lo_number;
> >  
> > @@ -2112,36 +2228,41 @@ static int loop_add(struct loop_device **l, int i)
> >  	return err;
> >  }
> >  
> > -static void loop_remove(struct loop_device *lo)
> > -{
> > -	del_gendisk(lo->lo_disk);
> > -	blk_cleanup_queue(lo->lo_queue);
> > -	blk_mq_free_tag_set(&lo->tag_set);
> > -	put_disk(lo->lo_disk);
> > -	kfree(lo);
> > -}
> > +struct find_free_cb_data {
> > +	struct loop_device **l;
> > +	struct inode *inode;
> > +};
> >  
> >  static int find_free_cb(int id, void *ptr, void *data)
> >  {
> >  	struct loop_device *lo = ptr;
> > -	struct loop_device **l = data;
> > +	struct find_free_cb_data *cb_data = data;
> >  
> > -	if (lo->lo_state == Lo_unbound) {
> > -		*l = lo;
> > -		return 1;
> > -	}
> > -	return 0;
> > +	if (lo->lo_state != Lo_unbound)
> > +		return 0;
> > +
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	if (!loopfs_access(cb_data->inode, lo))
> > +		return 0;
> > +#endif
> > +
> > +	*cb_data->l = lo;
> > +	return 1;
> >  }
> >  
> > -static int loop_lookup(struct loop_device **l, int i)
> > +static int loop_lookup(struct loop_device **l, int i, struct inode *inode)
> >  {
> >  	struct loop_device *lo;
> >  	int ret = -ENODEV;
> >  
> >  	if (i < 0) {
> >  		int err;
> > +		struct find_free_cb_data cb_data = {
> > +			.l = &lo,
> > +			.inode = inode,
> > +		};
> >  
> > -		err = idr_for_each(&loop_index_idr, &find_free_cb, &lo);
> > +		err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data);
> >  		if (err == 1) {
> >  			*l = lo;
> >  			ret = lo->lo_number;
> > @@ -2152,6 +2273,11 @@ static int loop_lookup(struct loop_device **l, int i)
> >  	/* lookup and return a specific i */
> >  	lo = idr_find(&loop_index_idr, i);
> >  	if (lo) {
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +		if (!loopfs_access(inode, lo))
> > +			return -EACCES;
> > +#endif
> > +
> >  		*l = lo;
> >  		ret = lo->lo_number;
> >  	}
> > @@ -2166,9 +2292,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data)
> >  	int err;
> >  
> >  	mutex_lock(&loop_ctl_mutex);
> > -	err = loop_lookup(&lo, MINOR(dev) >> part_shift);
> > +	err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL);
> >  	if (err < 0)
> > -		err = loop_add(&lo, MINOR(dev) >> part_shift);
> > +		err = loop_add(&lo, MINOR(dev) >> part_shift, NULL);
> >  	if (err < 0)
> >  		kobj = NULL;
> >  	else
> > @@ -2192,15 +2318,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
> >  	ret = -ENOSYS;
> >  	switch (cmd) {
> >  	case LOOP_CTL_ADD:
> > -		ret = loop_lookup(&lo, parm);
> > +		ret = loop_lookup(&lo, parm, file_inode(file));
> >  		if (ret >= 0) {
> >  			ret = -EEXIST;
> >  			break;
> >  		}
> > -		ret = loop_add(&lo, parm);
> > +		ret = loop_add(&lo, parm, file_inode(file));
> >  		break;
> >  	case LOOP_CTL_REMOVE:
> > -		ret = loop_lookup(&lo, parm);
> > +		ret = loop_lookup(&lo, parm, file_inode(file));
> >  		if (ret < 0)
> >  			break;
> >  		if (lo->lo_state != Lo_unbound) {
> > @@ -2212,14 +2338,13 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
> >  			break;
> >  		}
> >  		lo->lo_disk->private_data = NULL;
> > -		idr_remove(&loop_index_idr, lo->lo_number);
> > -		loop_remove(lo);
> > +		__loop_remove(lo);
> >  		break;
> >  	case LOOP_CTL_GET_FREE:
> > -		ret = loop_lookup(&lo, -1);
> > +		ret = loop_lookup(&lo, -1, file_inode(file));
> >  		if (ret >= 0)
> >  			break;
> > -		ret = loop_add(&lo, -1);
> > +		ret = loop_add(&lo, -1, file_inode(file));
> >  	}
> >  	mutex_unlock(&loop_ctl_mutex);
> >  
> > @@ -2246,7 +2371,6 @@ MODULE_ALIAS("devname:loop-control");
> >  static int __init loop_init(void)
> >  {
> >  	int i, nr;
> > -	unsigned long range;
> >  	struct loop_device *lo;
> >  	int err;
> >  
> > @@ -2285,10 +2409,10 @@ static int __init loop_init(void)
> >  	 */
> >  	if (max_loop) {
> >  		nr = max_loop;
> > -		range = max_loop << part_shift;
> > +		max_devices = max_loop << part_shift;
> >  	} else {
> >  		nr = CONFIG_BLK_DEV_LOOP_MIN_COUNT;
> > -		range = 1UL << MINORBITS;
> > +		max_devices = 1UL << MINORBITS;
> >  	}
> >  
> >  	err = misc_register(&loop_misc);
> > @@ -2301,13 +2425,13 @@ static int __init loop_init(void)
> >  		goto misc_out;
> >  	}
> >  
> > -	blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
> > +	blk_register_region(MKDEV(LOOP_MAJOR, 0), max_devices,
> >  				  THIS_MODULE, loop_probe, NULL, NULL);
> >  
> >  	/* pre-create number of devices given by config or max_loop */
> >  	mutex_lock(&loop_ctl_mutex);
> >  	for (i = 0; i < nr; i++)
> > -		loop_add(&lo, i);
> > +		loop_add(&lo, i, NULL);
> >  	mutex_unlock(&loop_ctl_mutex);
> >  
> >  	printk(KERN_INFO "loop: module loaded\n");
> > @@ -2329,14 +2453,10 @@ static int loop_exit_cb(int id, void *ptr, void *data)
> >  
> >  static void __exit loop_exit(void)
> >  {
> > -	unsigned long range;
> > -
> > -	range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
> > -
> >  	idr_for_each(&loop_index_idr, &loop_exit_cb, NULL);
> >  	idr_destroy(&loop_index_idr);
> >  
> > -	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
> > +	blk_unregister_region(MKDEV(LOOP_MAJOR, 0), max_devices);
> >  	unregister_blkdev(LOOP_MAJOR, "loop");
> >  
> >  	misc_deregister(&loop_misc);
> > diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> > index af75a5ee4094..6fed746b6124 100644
> > --- a/drivers/block/loop.h
> > +++ b/drivers/block/loop.h
> > @@ -17,6 +17,10 @@
> >  #include <linux/kthread.h>
> >  #include <uapi/linux/loop.h>
> >  
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +#include "loopfs/loopfs.h"
> > +#endif
> > +
> >  /* Possible states of device */
> >  enum {
> >  	Lo_unbound,
> > @@ -62,6 +66,9 @@ struct loop_device {
> >  	struct request_queue	*lo_queue;
> >  	struct blk_mq_tag_set	tag_set;
> >  	struct gendisk		*lo_disk;
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	struct lo_loopfs	*lo_info;
> > +#endif
> >  };
> >  
> >  struct loop_cmd {
> > @@ -89,6 +96,9 @@ struct loop_func_table {
> >  }; 
> >  
> >  int loop_register_transfer(struct loop_func_table *funcs);
> > -int loop_unregister_transfer(int number); 
> > +int loop_unregister_transfer(int number);
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +extern unsigned long max_devices;
> > +#endif
> >  
> >  #endif
> > diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile
> > new file mode 100644
> > index 000000000000..87ec703b662e
> > --- /dev/null
> > +++ b/drivers/block/loopfs/Makefile
> > @@ -0,0 +1,3 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +loopfs-y			:= loopfs.o
> > +obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs.o
> > diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
> > new file mode 100644
> > index 000000000000..b3461c72b6e7
> > --- /dev/null
> > +++ b/drivers/block/loopfs/loopfs.c
> > @@ -0,0 +1,494 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/fs_parser.h>
> > +#include <linux/fsnotify.h>
> > +#include <linux/genhd.h>
> > +#include <linux/init.h>
> > +#include <linux/list.h>
> > +#include <linux/magic.h>
> > +#include <linux/major.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/module.h>
> > +#include <linux/mount.h>
> > +#include <linux/namei.h>
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/seq_file.h>
> > +
> > +#include "../loop.h"
> > +#include "loopfs.h"
> > +
> > +#define FIRST_INODE 1
> > +#define SECOND_INODE 2
> > +#define INODE_OFFSET 3
> > +
> > +enum loopfs_param {
> > +	Opt_max,
> > +};
> > +
> > +const struct fs_parameter_spec loopfs_fs_parameters[] = {
> > +	fsparam_u32("max",	Opt_max),
> > +	{}
> > +};
> > +
> > +struct loopfs_mount_opts {
> > +	int max;
> > +};
> > +
> > +struct loopfs_info {
> > +	kuid_t root_uid;
> > +	kgid_t root_gid;
> > +	unsigned long device_count;
> > +	struct dentry *control_dentry;
> > +	struct loopfs_mount_opts mount_opts;
> > +};
> > +
> > +static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb)
> > +{
> > +	return sb->s_fs_info;
> > +}
> > +
> > +struct super_block *loopfs_i_sb(const struct inode *inode)
> > +{
> > +	if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC)
> > +		return inode->i_sb;
> > +
> > +	return NULL;
> > +}
> > +
> > +bool loopfs_device(const struct loop_device *lo)
> > +{
> > +	return lo->lo_info != NULL;
> > +}
> > +
> > +struct user_namespace *loopfs_ns(const struct loop_device *lo)
> > +{
> > +	if (loopfs_device(lo)) {
> > +		struct super_block *sb;
> > +
> > +		sb = loopfs_i_sb(lo->lo_info->lo_inode);
> > +		if (sb)
> > +			return sb->s_user_ns;
> > +	}
> > +
> > +	return &init_user_ns;
> > +}
> > +
> > +bool loopfs_access(const struct inode *first, struct loop_device *lo)
> > +{
> > +	return loopfs_device(lo) &&
> > +	       loopfs_i_sb(first) == loopfs_i_sb(lo->lo_info->lo_inode);
> > +}
> > +
> > +bool loopfs_wants_remove(const struct loop_device *lo)
> > +{
> > +	return lo->lo_info && (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE);
> > +}
> > +
> > +/**
> > + * loopfs_add - allocate inode from super block of a loopfs mount
> > + * @lo:		loop device for which we are creating a new device entry
> > + * @ref_inode:	inode from wich the super block will be taken
> > + * @device_nr:  device number of the associated disk device
> > + *
> > + * This function creates a new device node for @lo.
> > + * Minor numbers are limited and tracked globally. The
> > + * function will stash a struct loop_device for the specific loop
> > + * device in i_private of the inode.
> > + * It will go on to allocate a new inode from the super block of the
> > + * filesystem mount, stash a struct loop_device in its i_private field
> > + * and attach a dentry to that inode.
> > + *
> > + * Return: 0 on success, negative errno on failure
> > + */
> > +int loopfs_add(struct loop_device *lo, struct inode *ref_inode, dev_t device_nr)
> > +{
> > +	int ret;
> > +	char name[DISK_NAME_LEN];
> > +	struct super_block *sb;
> > +	struct loopfs_info *info;
> > +	struct dentry *root, *dentry;
> > +	struct inode *inode;
> > +	struct lo_loopfs *lo_info;
> > +
> > +	sb = loopfs_i_sb(ref_inode);
> > +	if (!sb)
> > +		return 0;
> > +
> > +	if (MAJOR(device_nr) != LOOP_MAJOR)
> > +		return -EINVAL;
> > +
> > +	lo_info = kzalloc(sizeof(struct lo_loopfs), GFP_KERNEL);
> > +	if (!lo_info) {
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	info = LOOPFS_SB(sb);
> > +	if ((info->device_count + 1) > info->mount_opts.max) {
> > +		ret = -ENOSPC;
> > +		goto err;
> > +	}
> > +
> > +	lo_info->lo_ucount = inc_ucount(sb->s_user_ns,
> > +					info->root_uid, UCOUNT_LOOP_DEVICES);
> > +	if (!lo_info->lo_ucount) {
> > +		ret = -ENOSPC;
> > +		goto err;
> > +	}
> > +
> > +	if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) {
> > +		ret = -EINVAL;
> > +		goto err;
> > +	}
> > +
> > +	inode = new_inode(sb);
> > +	if (!inode) {
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	/*
> > +	 * The i_fop field will be set to the correct fops by the device layer
> > +	 * when the loop device in this loopfs instance is opened.
> > +	 */
> > +	inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
> > +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> > +	inode->i_uid = info->root_uid;
> > +	inode->i_gid = info->root_gid;
> > +	init_special_inode(inode, S_IFBLK | 0600, device_nr);
> > +
> > +	root = sb->s_root;
> > +	inode_lock(d_inode(root));
> > +	/* look it up */
> > +	dentry = lookup_one_len(name, root, strlen(name));
> > +	if (IS_ERR(dentry)) {
> > +		inode_unlock(d_inode(root));
> > +		iput(inode);
> > +		ret = PTR_ERR(dentry);
> > +		goto err;
> > +	}
> > +
> > +	if (d_really_is_positive(dentry)) {
> > +		/* already exists */
> > +		dput(dentry);
> > +		inode_unlock(d_inode(root));
> > +		iput(inode);
> > +		ret = -EEXIST;
> > +		goto err;
> > +	}
> > +
> > +	d_instantiate(dentry, inode);
> > +	fsnotify_create(d_inode(root), dentry);
> > +	inode_unlock(d_inode(root));
> > +
> > +	lo_info->lo_inode = inode;
> > +	lo->lo_info = lo_info;
> > +	inode->i_private = lo;
> > +	info->device_count++;
> > +
> > +	return 0;
> > +
> > +err:
> > +	if (lo_info->lo_ucount)
> > +		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
> > +	kfree(lo_info);
> > +	return ret;
> > +}
> > +
> > +void loopfs_remove(struct loop_device *lo)
> > +{
> > +	struct lo_loopfs *lo_info = lo->lo_info;
> > +	struct inode *inode;
> > +	struct super_block *sb;
> > +	struct dentry *root, *dentry;
> > +
> > +	if (!lo_info)
> > +		return;
> > +
> > +	inode = lo_info->lo_inode;
> > +	if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR)
> > +		goto out;
> > +
> > +	sb = loopfs_i_sb(inode);
> > +	lo_info->lo_inode = NULL;
> > +
> > +	/*
> > +	 * The root dentry is always the parent dentry since we don't allow
> > +	 * creation of directories.
> > +	 */
> > +	root = sb->s_root;
> > +
> > +	inode_lock(d_inode(root));
> > +	dentry = d_find_any_alias(inode);
> > +	if (dentry && simple_positive(dentry)) {
> > +		simple_unlink(d_inode(root), dentry);
> > +		d_delete(dentry);
> > +	}
> > +	dput(dentry);
> > +	inode_unlock(d_inode(root));
> > +	LOOPFS_SB(sb)->device_count--;
> > +
> > +out:
> > +	if (lo_info->lo_ucount)
> > +		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
> > +	kfree(lo->lo_info);
> > +	lo->lo_info = NULL;
> > +}
> > +
> > +static void loopfs_fs_context_free(struct fs_context *fc)
> > +{
> > +	struct loopfs_mount_opts *ctx = fc->fs_private;
> > +
> > +	kfree(ctx);
> > +}
> > +
> > +/**
> > + * loopfs_loop_ctl_create - create a new loop-control device
> > + * @sb: super block of the loopfs mount
> > + *
> > + * This function creates a new loop-control device node in the loopfs mount
> > + * referred to by @sb.
> > + *
> > + * Return: 0 on success, negative errno on failure
> > + */
> > +static int loopfs_loop_ctl_create(struct super_block *sb)
> > +{
> > +	struct dentry *dentry;
> > +	struct inode *inode = NULL;
> > +	struct dentry *root = sb->s_root;
> > +	struct loopfs_info *info = sb->s_fs_info;
> > +
> > +	if (info->control_dentry)
> > +		return 0;
> > +
> > +	inode = new_inode(sb);
> > +	if (!inode)
> > +		return -ENOMEM;
> > +
> > +	inode->i_ino = SECOND_INODE;
> > +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> > +	init_special_inode(inode, S_IFCHR | 0600,
> > +			   MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR));
> > +	/*
> > +	 * The i_fop field will be set to the correct fops by the device layer
> > +	 * when the loop-control device in this loopfs instance is opened.
> > +	 */
> > +	inode->i_uid = info->root_uid;
> > +	inode->i_gid = info->root_gid;
> > +
> > +	dentry = d_alloc_name(root, "loop-control");
> > +	if (!dentry) {
> > +		iput(inode);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	info->control_dentry = dentry;
> > +	d_add(dentry, inode);
> > +
> > +	return 0;
> > +}
> > +
> > +static inline bool is_loopfs_control_device(const struct dentry *dentry)
> > +{
> > +	return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry;
> > +}
> > +
> > +static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> > +			 struct inode *new_dir, struct dentry *new_dentry,
> > +			 unsigned int flags)
> > +{
> > +	if (is_loopfs_control_device(old_dentry) ||
> > +	    is_loopfs_control_device(new_dentry))
> > +		return -EPERM;
> > +
> > +	return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
> > +}
> > +
> > +static int loopfs_unlink(struct inode *dir, struct dentry *dentry)
> > +{
> > +	int ret;
> > +	struct loop_device *lo;
> > +
> > +	if (is_loopfs_control_device(dentry))
> > +		return -EPERM;
> > +
> > +	lo = d_inode(dentry)->i_private;
> > +	ret = loopfs_rundown_locked(lo);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return simple_unlink(dir, dentry);
> > +}
> > +
> > +static const struct inode_operations loopfs_dir_inode_operations = {
> > +	.lookup = simple_lookup,
> > +	.rename = loopfs_rename,
> > +	.unlink = loopfs_unlink,
> > +};
> > +
> > +static void loopfs_evict_inode(struct inode *inode)
> > +{
> > +	struct loop_device *lo = inode->i_private;
> > +
> > +	clear_inode(inode);
> > +
> > +	if (lo && S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) {
> > +		loopfs_evict_locked(lo);
> > +		LOOPFS_SB(inode->i_sb)->device_count--;
> > +		inode->i_private = NULL;
> > +	}
> > +}
> > +
> > +static int loopfs_show_options(struct seq_file *seq, struct dentry *root)
> > +{
> > +	struct loopfs_info *info = LOOPFS_SB(root->d_sb);
> > +
> > +	if (info->mount_opts.max <= max_devices)
> > +		seq_printf(seq, ",max=%d", info->mount_opts.max);
> > +
> > +	return 0;
> > +}
> > +
> > +static void loopfs_put_super(struct super_block *sb)
> > +{
> > +	struct loopfs_info *info = sb->s_fs_info;
> > +
> > +	sb->s_fs_info = NULL;
> > +	kfree(info);
> > +}
> > +
> > +static const struct super_operations loopfs_super_ops = {
> > +	.evict_inode    = loopfs_evict_inode,
> > +	.show_options	= loopfs_show_options,
> > +	.statfs         = simple_statfs,
> > +	.put_super	= loopfs_put_super,
> > +};
> > +
> > +static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc)
> > +{
> > +	struct loopfs_info *info;
> > +	struct loopfs_mount_opts *ctx = fc->fs_private;
> > +	struct inode *inode = NULL;
> > +
> > +	sb->s_blocksize = PAGE_SIZE;
> > +	sb->s_blocksize_bits = PAGE_SHIFT;
> > +
> > +	sb->s_iflags &= ~SB_I_NODEV;
> > +	sb->s_iflags |= SB_I_NOEXEC;
> > +	sb->s_magic = LOOPFS_SUPER_MAGIC;
> > +	sb->s_op = &loopfs_super_ops;
> > +	sb->s_time_gran = 1;
> > +
> > +	sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL);
> > +	if (!sb->s_fs_info)
> > +		return -ENOMEM;
> > +	info = sb->s_fs_info;
> > +
> > +	info->root_gid = make_kgid(sb->s_user_ns, 0);
> > +	if (!gid_valid(info->root_gid))
> > +		info->root_gid = GLOBAL_ROOT_GID;
> > +	info->root_uid = make_kuid(sb->s_user_ns, 0);
> > +	if (!uid_valid(info->root_uid))
> > +		info->root_uid = GLOBAL_ROOT_UID;
> > +	info->mount_opts.max = ctx->max;
> > +
> > +	inode = new_inode(sb);
> > +	if (!inode)
> > +		return -ENOMEM;
> > +
> > +	inode->i_ino = FIRST_INODE;
> > +	inode->i_fop = &simple_dir_operations;
> > +	inode->i_mode = S_IFDIR | 0755;
> > +	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> > +	inode->i_op = &loopfs_dir_inode_operations;
> > +	set_nlink(inode, 2);
> > +
> > +	sb->s_root = d_make_root(inode);
> > +	if (!sb->s_root)
> > +		return -ENOMEM;
> > +
> > +	return loopfs_loop_ctl_create(sb);
> > +}
> > +
> > +static int loopfs_fs_context_get_tree(struct fs_context *fc)
> > +{
> > +	return get_tree_nodev(fc, loopfs_fill_super);
> > +}
> > +
> > +static int loopfs_fs_context_parse_param(struct fs_context *fc,
> > +					 struct fs_parameter *param)
> > +{
> > +	int opt;
> > +	struct loopfs_mount_opts *ctx = fc->fs_private;
> > +	struct fs_parse_result result;
> > +
> > +	opt = fs_parse(fc, loopfs_fs_parameters, param, &result);
> > +	if (opt < 0)
> > +		return opt;
> > +
> > +	switch (opt) {
> > +	case Opt_max:
> > +		if (result.uint_32 > max_devices)
> > +			return invalfc(fc, "Bad value for '%s'", param->key);
> > +
> > +		ctx->max = result.uint_32;
> > +		break;
> > +	default:
> > +		return invalfc(fc, "Unsupported parameter '%s'", param->key);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int loopfs_fs_context_reconfigure(struct fs_context *fc)
> > +{
> > +	struct loopfs_mount_opts *ctx = fc->fs_private;
> > +	struct loopfs_info *info = LOOPFS_SB(fc->root->d_sb);
> > +
> > +	info->mount_opts.max = ctx->max;
> > +	return 0;
> > +}
> > +
> > +static const struct fs_context_operations loopfs_fs_context_ops = {
> > +	.free		= loopfs_fs_context_free,
> > +	.get_tree	= loopfs_fs_context_get_tree,
> > +	.parse_param	= loopfs_fs_context_parse_param,
> > +	.reconfigure	= loopfs_fs_context_reconfigure,
> > +};
> > +
> > +static int loopfs_init_fs_context(struct fs_context *fc)
> > +{
> > +	struct loopfs_mount_opts *ctx = fc->fs_private;
> > +
> > +	ctx = kzalloc(sizeof(struct loopfs_mount_opts), GFP_KERNEL);
> > +	if (!ctx)
> > +		return -ENOMEM;
> > +
> > +	ctx->max = max_devices;
> > +
> > +	fc->fs_private = ctx;
> > +
> > +	fc->ops = &loopfs_fs_context_ops;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct file_system_type loop_fs_type = {
> > +	.name			= "loop",
> > +	.init_fs_context	= loopfs_init_fs_context,
> > +	.parameters		= loopfs_fs_parameters,
> > +	.kill_sb		= kill_litter_super,
> > +	.fs_flags		= FS_USERNS_MOUNT,
> > +};
> > +
> > +int __init init_loopfs(void)
> > +{
> > +	init_user_ns.ucount_max[UCOUNT_LOOP_DEVICES] = 255;
> > +	return register_filesystem(&loop_fs_type);
> > +}
> > +
> > +module_init(init_loopfs);
> > +MODULE_AUTHOR("Christian Brauner <christian.brauner@ubuntu.com>");
> > +MODULE_DESCRIPTION("Loop device filesystem");
> > diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
> > new file mode 100644
> > index 000000000000..2ee114aa3fa9
> > --- /dev/null
> > +++ b/drivers/block/loopfs/loopfs.h
> > @@ -0,0 +1,36 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _LINUX_LOOPFS_FS_H
> > +#define _LINUX_LOOPFS_FS_H
> > +
> > +#include <linux/errno.h>
> > +#include <linux/fs.h>
> > +#include <linux/magic.h>
> > +#include <linux/user_namespace.h>
> > +
> > +struct loop_device;
> > +
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +
> > +#define LOOPFS_FLAGS_INACTIVE (1 << 0)
> > +
> > +struct lo_loopfs {
> > +	struct ucounts *lo_ucount;
> > +	struct inode *lo_inode;
> > +	int lo_flags;
> > +};
> > +
> > +extern struct super_block *loopfs_i_sb(const struct inode *inode);
> > +extern bool loopfs_device(const struct loop_device *lo);
> > +extern struct user_namespace *loopfs_ns(const struct loop_device *lo);
> > +extern bool loopfs_access(const struct inode *first, struct loop_device *lo);
> > +extern int loopfs_add(struct loop_device *lo, struct inode *ref_inode,
> > +		      dev_t device_nr);
> > +extern void loopfs_remove(struct loop_device *lo);
> > +extern bool loopfs_wants_remove(const struct loop_device *lo);
> > +extern void loopfs_evict_locked(struct loop_device *lo);
> > +extern int loopfs_rundown_locked(struct loop_device *lo);
> > +
> > +#endif
> > +
> > +#endif /* _LINUX_LOOPFS_FS_H */
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index 6ef1c7109fc4..04a4891765c0 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -49,6 +49,9 @@ enum ucount_type {
> >  #ifdef CONFIG_INOTIFY_USER
> >  	UCOUNT_INOTIFY_INSTANCES,
> >  	UCOUNT_INOTIFY_WATCHES,
> > +#endif
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	UCOUNT_LOOP_DEVICES,
> >  #endif
> >  	UCOUNT_COUNTS,
> >  };
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index d78064007b17..0817d093a012 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -75,6 +75,7 @@
> >  #define BINFMTFS_MAGIC          0x42494e4d
> >  #define DEVPTS_SUPER_MAGIC	0x1cd1
> >  #define BINDERFS_SUPER_MAGIC	0x6c6f6f70
> > +#define LOOPFS_SUPER_MAGIC	0x6c6f6f71
> >  #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
> >  #define PIPEFS_MAGIC            0x50495045
> >  #define PROC_SUPER_MAGIC	0x9fa0
> > diff --git a/kernel/ucount.c b/kernel/ucount.c
> > index 11b1596e2542..fb0f6394a8bb 100644
> > --- a/kernel/ucount.c
> > +++ b/kernel/ucount.c
> > @@ -73,6 +73,9 @@ static struct ctl_table user_table[] = {
> >  #ifdef CONFIG_INOTIFY_USER
> >  	UCOUNT_ENTRY("max_inotify_instances"),
> >  	UCOUNT_ENTRY("max_inotify_watches"),
> > +#endif
> > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > +	UCOUNT_ENTRY("max_loop_devices"),
> >  #endif
> >  	{ }
> >  };
> > -- 
> > 2.26.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] loop: preserve sysfs backwards compatibility
  2020-04-23 11:15     ` Christian Brauner
@ 2020-04-23 16:13       ` Serge E. Hallyn
  0 siblings, 0 replies; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-23 16:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Serge E. Hallyn, Jens Axboe, Greg Kroah-Hartman, linux-kernel,
	linux-block, linux-api, Jonathan Corbet, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Thu, Apr 23, 2020 at 01:15:24PM +0200, Christian Brauner wrote:
> On Wed, Apr 22, 2020 at 08:17:06PM -0500, Serge Hallyn wrote:
> > On Wed, Apr 22, 2020 at 04:54:35PM +0200, Christian Brauner wrote:
> > > For sysfs the initial namespace is special. All devices currently
> > > propagate into all non-initial namespaces. For example, sysfs is usually
> > > mounted in a privileged or unprivileged container and all devices are
> > > visible to the container but are owned by global root. Even though none
> > > of the propagated files can be used there are still a lot of read-only
> > > values that are accessed or read by tools running in non-initial
> > > namespaces. Some devices though, which can be moved or created in
> > > another namespace, will only show up in the corresponding namespace.
> > > This currently includes network and loop devices but no other ones.
> > > Since all current workloads depend on devices from the inital namespace
> > > being visible this behavior cannot be simply changed. This patch just
> > > makes sure to keep propagating devices that share the same device class
> > > with loop devices from the initial namespaces into all non-initial
> > > namespaces as before. In short, nothing changes only loopfs loop devices
> > > will be shown in their correct namespace.
> > > 
> > > Cc: Jens Axboe <axboe@kernel.dk>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > 
> > Hi,
> > 
> > two comments below:
> > 
> > > ---
> > > /* v2 */
> > > - Christian Brauner <christian.brauner@ubuntu.com>:
> > >   - Protect init_net with a CONFIG_NET ifdef in case it is set to "n".
> > >   - As Tejun pointed out there is argument to be made that a new mount
> > >     option for sysfs could be added that would change how devices are
> > >     propagated. This patch does not prevent this but it is an orthogonal
> > >     problem.
> > > ---
> > >  block/genhd.c               | 79 +++++++++++++++++++++++++++++++++++++
> > >  fs/kernfs/dir.c             | 34 +++++++++++++---
> > >  fs/kernfs/kernfs-internal.h | 24 +++++++++++
> > >  fs/sysfs/mount.c            |  4 ++
> > >  include/linux/genhd.h       |  3 ++
> > >  include/linux/kernfs.h      | 22 +++++++++++
> > >  include/linux/kobject_ns.h  |  4 ++
> > >  lib/kobject.c               |  2 +
> > >  8 files changed, 167 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index 06b642b23a07..b5b2601c4311 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -1198,11 +1198,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
> > >  	return NULL;
> > >  }
> > >  
> > > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > > +static void *user_grab_current_ns(void)
> > > +{
> > > +	struct user_namespace *ns = current_user_ns();
> > > +	return get_user_ns(ns);
> > > +}
> > > +
> > > +static const void *user_initial_ns(void)
> > > +{
> > > +	return &init_user_ns;
> > > +}
> > > +
> > > +static void user_put_ns(void *p)
> > > +{
> > > +	struct user_namespace *ns = p;
> > > +	put_user_ns(ns);
> > > +}
> > > +
> > > +static bool user_current_may_mount(void)
> > > +{
> > > +	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
> > > +}
> > > +
> > > +const struct kobj_ns_type_operations user_ns_type_operations = {
> > > +	.type			= KOBJ_NS_TYPE_USER,
> > > +	.current_may_mount	= user_current_may_mount,
> > > +	.grab_current_ns	= user_grab_current_ns,
> > > +	.initial_ns		= user_initial_ns,
> > > +	.drop_ns		= user_put_ns,
> > > +};
> > > +
> > > +static const void *block_class_user_namespace(struct device *dev)
> > > +{
> > > +	struct gendisk *disk;
> > > +
> > > +	if (dev->type == &part_type)
> > > +		disk = part_to_disk(dev_to_part(dev));
> > > +	else
> > > +		disk = dev_to_disk(dev);
> > > +
> > > +	return disk->user_ns;
> > > +}
> > > +
> > > +static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
> > > +{
> > > +	struct gendisk *disk;
> > > +	struct user_namespace *ns;
> > > +
> > > +	if (dev->type == &part_type)
> > > +		disk = part_to_disk(dev_to_part(dev));
> > > +	else
> > > +		disk = dev_to_disk(dev);
> > > +
> > > +	ns = disk->user_ns;
> > > +	if (ns && ns != &init_user_ns) {
> > > +		kuid_t ns_root_uid = make_kuid(ns, 0);
> > > +		kgid_t ns_root_gid = make_kgid(ns, 0);
> > > +
> > > +		if (uid_valid(ns_root_uid))
> > > +			*uid = ns_root_uid;
> > > +
> > > +		if (gid_valid(ns_root_gid))
> > > +			*gid = ns_root_gid;
> > > +	}
> > 
> > You're not setting uid and gid in the else case?
> 
> Right, the reason being that sysfs and the associated kobject
> infrastructure will always set global root as the default. So the

Oh, ok, I had thought that would be the case but failed to find
it yesterday :)  thx

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> callchain is:
> kobject_get_ownership()
> and this calls the ktype callbacks which hits
> -> device_get_ownership()
> which calls into the device class specific callbacks which in this is
> case calls block_class_get_ownership().
> 
> And there's no direct callers of, say <device-class>->get_ownership()
> that all needs to always go through the callback infrastructure.
> 
> > 
> > > +}
> > > +#endif /* CONFIG_BLK_DEV_LOOPFS */
> > > +
> > >  static int __init genhd_device_init(void)
> > >  {
> > >  	int error;
> > >  
> > >  	block_class.dev_kobj = sysfs_dev_block_kobj;
> > > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > > +	kobj_ns_type_register(&user_ns_type_operations);
> > > +#endif
> > >  	error = class_register(&block_class);
> > >  	if (unlikely(error))
> > >  		return error;
> > > @@ -1524,8 +1594,14 @@ static void disk_release(struct device *dev)
> > >  		blk_put_queue(disk->queue);
> > >  	kfree(disk);
> > >  }
> > > +
> > >  struct class block_class = {
> > >  	.name		= "block",
> > > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > > +	.ns_type	= &user_ns_type_operations,
> > > +	.namespace	= block_class_user_namespace,
> > > +	.get_ownership	= block_class_get_ownership,
> > > +#endif
> > >  };
> > >  
> > >  static char *block_devnode(struct device *dev, umode_t *mode,
> > > @@ -1715,6 +1791,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
> > >  		disk_to_dev(disk)->class = &block_class;
> > >  		disk_to_dev(disk)->type = &disk_type;
> > >  		device_initialize(disk_to_dev(disk));
> > > +#ifdef CONFIG_BLK_DEV_LOOPFS
> > > +		disk->user_ns = &init_user_ns;
> > > +#endif
> > >  	}
> > >  	return disk;
> > >  }
> > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > index 1f2d894ae454..02796ba6521a 100644
> > > --- a/fs/kernfs/dir.c
> > > +++ b/fs/kernfs/dir.c
> > > @@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
> > >  		goto out_bad;
> > >  
> > >  	/* The kernfs node has been moved to a different namespace */
> > > -	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
> > > -	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
> > > -		goto out_bad;
> > > +	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
> > > +		if (kernfs_init_ns_propagates(kn->parent) &&
> > > +		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
> > > +			goto out_good;
> > > +		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
> > > +			goto out_bad;
> > > +	}
> > >  
> > > +out_good:
> > >  	mutex_unlock(&kernfs_mutex);
> > >  	return 1;
> > >  out_bad:
> > > @@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
> > >  		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
> > >  
> > >  	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> > > +	if (!kn && kernfs_init_ns_propagates(parent)) {
> > > +		ns = kernfs_init_ns(parent->ns_type);
> > > +		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
> > > +	}
> > >  
> > >  	/* no such entry */
> > >  	if (!kn || !kernfs_active(kn)) {
> > > @@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
> > >  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> > >  	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> > >  {
> > > +	const void *init_ns;
> > > +
> > >  	if (pos) {
> > >  		int valid = kernfs_active(pos) &&
> > >  			pos->parent == parent && hash == pos->hash;
> > > @@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
> > >  		if (!valid)
> > >  			pos = NULL;
> > >  	}
> > > +
> > > +	if (kernfs_init_ns_propagates(parent))
> > > +		init_ns = kernfs_init_ns(parent->ns_type);
> > > +	else
> > > +		init_ns = NULL;
> > > +
> > >  	if (!pos && (hash > 1) && (hash < INT_MAX)) {
> > >  		struct rb_node *node = parent->dir.children.rb_node;
> > >  		while (node) {
> > > @@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
> > >  		}
> > >  	}
> > >  	/* Skip over entries which are dying/dead or in the wrong namespace */
> > > -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> > > +	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
> > >  		struct rb_node *node = rb_next(&pos->rb);
> > >  		if (!node)
> > >  			pos = NULL;
> > > @@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > >  {
> > >  	pos = kernfs_dir_pos(ns, parent, ino, pos);
> > >  	if (pos) {
> > > +		const void *init_ns;
> > > +		if (kernfs_init_ns_propagates(parent))
> > > +			init_ns = kernfs_init_ns(parent->ns_type);
> > > +		else
> > > +			init_ns = NULL;
> > > +
> > >  		do {
> > >  			struct rb_node *node = rb_next(&pos->rb);
> > >  			if (!node)
> > >  				pos = NULL;
> > >  			else
> > >  				pos = rb_to_kn(node);
> > > -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
> > > +		} while (pos && (!kernfs_active(pos) ||
> > > +				 (pos->ns != ns && pos->ns != init_ns)));
> > >  	}
> > >  	return pos;
> > >  }
> > > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> > > index 7c972c00f84a..74eb6c447361 100644
> > > --- a/fs/kernfs/kernfs-internal.h
> > > +++ b/fs/kernfs/kernfs-internal.h
> > > @@ -80,6 +80,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
> > >  	return d_inode(dentry)->i_private;
> > >  }
> > >  
> > > +#ifdef CONFIG_NET
> > > +extern struct net init_net;
> > > +#endif
> > > +
> > > +extern struct user_namespace init_user_ns;
> > > +
> > > +static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
> > > +{
> > > +	switch (ns_type) {
> > > +	case KOBJ_NS_TYPE_NET:
> > > +#ifdef CONFIG_NET
> > > +		return &init_net;
> > > +#else
> > > +		break;
> > > +#endif
> > > +	case KOBJ_NS_TYPE_USER:
> > > +		return &init_user_ns;
> > > +	default:
> > > +		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
> > > +	}
> > > +
> > > +	return NULL;
> > > +}
> > > +
> > >  extern const struct super_operations kernfs_sops;
> > >  extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
> > >  
> > > diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> > > index 5e2ec88a709e..99b82a0ae7ea 100644
> > > --- a/fs/sysfs/mount.c
> > > +++ b/fs/sysfs/mount.c
> > > @@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
> > >  
> > >  	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
> > >  		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
> > > +	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
> > > +		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
> > >  	kernfs_free_fs_context(fc);
> > >  	kfree(kfc);
> > >  }
> > > @@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
> > >  		return -ENOMEM;
> > >  
> > >  	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
> > > +	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
> > 
> > It's nice and tidy this way so maybe worth it, but getting
> > the kobj_ns_type_lock spinlock twice in a row here seems
> > unfortunate.
> 
> Let me see if I can do something non-ugly and moderately simple about
> this. If not, it's probably fine as it is since it only happens on sysfs
> mount.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] loopfs: implement loopfs
  2020-04-23 11:24     ` Christian Brauner
@ 2020-04-23 16:17       ` Serge E. Hallyn
  2020-04-23 16:29         ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Serge E. Hallyn @ 2020-04-23 16:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Serge E. Hallyn, Jens Axboe, Greg Kroah-Hartman, linux-kernel,
	linux-block, linux-api, Jonathan Corbet, Rafael J. Wysocki,
	Tejun Heo, David S. Miller, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

On Thu, Apr 23, 2020 at 01:24:01PM +0200, Christian Brauner wrote:
> On Wed, Apr 22, 2020 at 04:52:13PM -0500, Serge Hallyn wrote:
> > On Wed, Apr 22, 2020 at 04:54:32PM +0200, Christian Brauner wrote:
> > > This implements loopfs, a loop device filesystem. It takes inspiration
> > > from the binderfs filesystem I implemented about two years ago and with
> > > which we had overall good experiences so far. Parts of it are also
> > > based on [3] but it's mostly a new, imho cleaner approach.
> > > 
> > > Loopfs allows to create private loop devices instances to applications
> > > for various use-cases. It covers the use-case that was expressed on-list
> > > and in-person to get programmatic access to private loop devices for
> > > image building in sandboxes. An illustration for this is provided in
> > > [4].
> > > 
> > > Also loopfs is intended to provide loop devices to privileged and
> > > unprivileged containers which has been a frequent request from various
> > > major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
> > > providing a non-exhaustive list of issues and requests (cf. [5]) around
> > > this feature mainly to illustrate that I'm not making the use-cases up.
> > > Currently none of this can be done safely since handing a loop device
> > > from the host into a container means that the container can see anything
> > > that the host is doing with that loop device and what other containers
> > > are doing with that device too. And (bind-)mounting devtmpfs inside of
> > > containers is not secure at all so also not an option (though sometimes
> > > done out of despair apparently).
> > > 
> > > The workloads people run in containers are supposed to be indiscernible
> > > from workloads run on the host and the tools inside of the container are
> > > supposed to not be required to be aware that they are running inside a
> > > container apart from containerization tools themselves. This is
> > > especially true when running older distros in containers that did exist
> > > before containers were as ubiquitous as they are today. With loopfs user
> > > can call mount -o loop and in a correctly setup container things work
> > > the same way they would on the host. The filesystem representation
> > > allows us to do this in a very simple way. At container setup, a
> > > container manager can mount a private instance of loopfs somehwere, e.g.
> > > at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
> > > to /dev/loop-control, pre allocate and symlink the number of standard
> > > devices into their standard location and have a service file or rules in
> > > place that symlink additionally allocated loop devices through losetup
> > > into place as well.
> > > With the new syscall interception logic this is also possible for
> > > unprivileged containers. In these cases when a user calls mount -o loop
> > > <image> <mountpoint> it will be possible to completely setup the loop
> > > device in the container. The final mount syscall is handled through
> > > syscall interception which we already implemented and released in
> > > earlier kernels (see [1] and [2]) and is actively used in production
> > > workloads. The mount is often rewritten to a fuse binary to provide safe
> > > access for unprivileged containers.
> > > 
> > > Loopfs also allows the creation of hidden/detached dynamic loop devices
> > > and associated mounts which also was a often issued request. With the
> > > old mount api this can be achieved by creating a temporary loopfs and
> > > stashing a file descriptor to the mount point and the loop-control
> > > device and immediately unmounting the loopfs instance.  With the new
> > > mount api a detached mount can be created directly (i.e. a mount not
> > > visible anywhere in the filesystem). New loop devices can then be
> > > allocated and configured. They can be mounted through
> > > /proc/self/<fd>/<nr> with the old mount api or by using the fd directly
> > > with the new mount api. Combined with a mount namespace this allows for
> > > fully auto-cleaned up loop devices on program crash. This ties back to
> > > various use-cases and is illustrated in [4].
> > > 
> > > The filesystem representation requires the standard boilerplate
> > > filesystem code we know from other tiny filesystems. And all of
> > > the loopfs code is hidden under a config option that defaults to false.
> > > This specifically means, that none of the code even exists when users do
> > > not have any use-case for loopfs.
> > > In addition, the loopfs code does not alter how loop devices behave at
> > > all, i.e. there are no changes to any existing workloads and I've taken
> > > care to ifdef all loopfs specific things out.
> > > 
> > > Each loopfs mount is a separate instance. As such loop devices created
> > > in one instance are independent of loop devices created in another
> > > instance. This specifically entails that loop devices are only visible
> > > in the loopfs instance they belong to.
> > > 
> > > The number of loop devices available in loopfs instances are
> > > hierarchically limited through /proc/sys/user/max_loop_devices via the
> > > ucount infrastructure (Thanks to David Rheinsberg for pointing out that
> > > missing piece.). An administrator could e.g. set
> > > echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
> > > instance mounted by uid x can only create 3 loop devices no matter how
> > > many loopfs instances they mount. This limit applies hierarchically to
> > > all user namespaces.
> > 
> > Hm, info->device_count is per loopfs mount, though, right?  I don't
> > see where this gets incremented for all of a user's loopfs mounts
> > when one adds a loopdev?
> > 
> > I'm sure I'm missing something obvious...
> 
> Hm, I think you might be mixing up the two limits? device_count
> corresponds to the "max" mount option and is not involved in enforcing
> hierarchical limits. The global restriction is enforced through
> inc_ucount() which tracks by the uid of the mounter of the superblock.
> If the same user mounts multiple loopfs instances in the same namespace
> the ucount infra will enforce his quota across all loopfs instances.

Well I'm trying to understand what the point of the max mount option
is :)  I can just do N mounts to get N*max mounts to work around it?
But meanwhile if I have a daemon mounting isos over loopdevs to extract
some files (bc I never heard of bsdtar :), I risk more spurious failures
due to hitting max?

If you think we need it, that's fine - it just has the odor of something
more trouble than it's worth.

Anyway, with or without it,

Reviewed-by: Serge Hallyn <serge@hallyn.com>

thanks,
-serge

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] loopfs: implement loopfs
  2020-04-23 16:17       ` Serge E. Hallyn
@ 2020-04-23 16:29         ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2020-04-23 16:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block,
	linux-api, Jonathan Corbet, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Saravana Kannan, Jan Kara, David Howells,
	Seth Forshee, David Rheinsberg, Tom Gundersen, Christian Kellner,
	Dmitry Vyukov, Stéphane Graber, linux-doc, netdev,
	Steve Barber, Dylan Reid, Filipe Brandenburger, Kees Cook,
	Benjamin Elder, Akihiro Suda

On Thu, Apr 23, 2020 at 11:17:17AM -0500, Serge Hallyn wrote:
> On Thu, Apr 23, 2020 at 01:24:01PM +0200, Christian Brauner wrote:
> > On Wed, Apr 22, 2020 at 04:52:13PM -0500, Serge Hallyn wrote:
> > > On Wed, Apr 22, 2020 at 04:54:32PM +0200, Christian Brauner wrote:
> > > > This implements loopfs, a loop device filesystem. It takes inspiration
> > > > from the binderfs filesystem I implemented about two years ago and with
> > > > which we had overall good experiences so far. Parts of it are also
> > > > based on [3] but it's mostly a new, imho cleaner approach.
> > > > 
> > > > Loopfs allows to create private loop devices instances to applications
> > > > for various use-cases. It covers the use-case that was expressed on-list
> > > > and in-person to get programmatic access to private loop devices for
> > > > image building in sandboxes. An illustration for this is provided in
> > > > [4].
> > > > 
> > > > Also loopfs is intended to provide loop devices to privileged and
> > > > unprivileged containers which has been a frequent request from various
> > > > major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
> > > > providing a non-exhaustive list of issues and requests (cf. [5]) around
> > > > this feature mainly to illustrate that I'm not making the use-cases up.
> > > > Currently none of this can be done safely since handing a loop device
> > > > from the host into a container means that the container can see anything
> > > > that the host is doing with that loop device and what other containers
> > > > are doing with that device too. And (bind-)mounting devtmpfs inside of
> > > > containers is not secure at all so also not an option (though sometimes
> > > > done out of despair apparently).
> > > > 
> > > > The workloads people run in containers are supposed to be indiscernible
> > > > from workloads run on the host and the tools inside of the container are
> > > > supposed to not be required to be aware that they are running inside a
> > > > container apart from containerization tools themselves. This is
> > > > especially true when running older distros in containers that did exist
> > > > before containers were as ubiquitous as they are today. With loopfs user
> > > > can call mount -o loop and in a correctly setup container things work
> > > > the same way they would on the host. The filesystem representation
> > > > allows us to do this in a very simple way. At container setup, a
> > > > container manager can mount a private instance of loopfs somehwere, e.g.
> > > > at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
> > > > to /dev/loop-control, pre allocate and symlink the number of standard
> > > > devices into their standard location and have a service file or rules in
> > > > place that symlink additionally allocated loop devices through losetup
> > > > into place as well.
> > > > With the new syscall interception logic this is also possible for
> > > > unprivileged containers. In these cases when a user calls mount -o loop
> > > > <image> <mountpoint> it will be possible to completely setup the loop
> > > > device in the container. The final mount syscall is handled through
> > > > syscall interception which we already implemented and released in
> > > > earlier kernels (see [1] and [2]) and is actively used in production
> > > > workloads. The mount is often rewritten to a fuse binary to provide safe
> > > > access for unprivileged containers.
> > > > 
> > > > Loopfs also allows the creation of hidden/detached dynamic loop devices
> > > > and associated mounts which also was a often issued request. With the
> > > > old mount api this can be achieved by creating a temporary loopfs and
> > > > stashing a file descriptor to the mount point and the loop-control
> > > > device and immediately unmounting the loopfs instance.  With the new
> > > > mount api a detached mount can be created directly (i.e. a mount not
> > > > visible anywhere in the filesystem). New loop devices can then be
> > > > allocated and configured. They can be mounted through
> > > > /proc/self/<fd>/<nr> with the old mount api or by using the fd directly
> > > > with the new mount api. Combined with a mount namespace this allows for
> > > > fully auto-cleaned up loop devices on program crash. This ties back to
> > > > various use-cases and is illustrated in [4].
> > > > 
> > > > The filesystem representation requires the standard boilerplate
> > > > filesystem code we know from other tiny filesystems. And all of
> > > > the loopfs code is hidden under a config option that defaults to false.
> > > > This specifically means, that none of the code even exists when users do
> > > > not have any use-case for loopfs.
> > > > In addition, the loopfs code does not alter how loop devices behave at
> > > > all, i.e. there are no changes to any existing workloads and I've taken
> > > > care to ifdef all loopfs specific things out.
> > > > 
> > > > Each loopfs mount is a separate instance. As such loop devices created
> > > > in one instance are independent of loop devices created in another
> > > > instance. This specifically entails that loop devices are only visible
> > > > in the loopfs instance they belong to.
> > > > 
> > > > The number of loop devices available in loopfs instances are
> > > > hierarchically limited through /proc/sys/user/max_loop_devices via the
> > > > ucount infrastructure (Thanks to David Rheinsberg for pointing out that
> > > > missing piece.). An administrator could e.g. set
> > > > echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
> > > > instance mounted by uid x can only create 3 loop devices no matter how
> > > > many loopfs instances they mount. This limit applies hierarchically to
> > > > all user namespaces.
> > > 
> > > Hm, info->device_count is per loopfs mount, though, right?  I don't
> > > see where this gets incremented for all of a user's loopfs mounts
> > > when one adds a loopdev?
> > > 
> > > I'm sure I'm missing something obvious...
> > 
> > Hm, I think you might be mixing up the two limits? device_count
> > corresponds to the "max" mount option and is not involved in enforcing
> > hierarchical limits. The global restriction is enforced through
> > inc_ucount() which tracks by the uid of the mounter of the superblock.
> > If the same user mounts multiple loopfs instances in the same namespace
> > the ucount infra will enforce his quota across all loopfs instances.
> 
> Well I'm trying to understand what the point of the max mount option
> is :)  I can just do N mounts to get N*max mounts to work around it?
> But meanwhile if I have a daemon mounting isos over loopdevs to extract
> some files (bc I never heard of bsdtar :), I risk more spurious failures
> due to hitting max?
> 
> If you think we need it, that's fine - it just has the odor of something
> more trouble than it's worth.

Maybe I'm making too much of this and you're right. My use-case was
sharing the same loopfs superblock with locked down parties. But I guess
that we can handle that case with the ucount infra too. And since I'm
the only one who thinks it might be useful I'll drop it from this
patchset. If we have a need for it we can readd it later. Sound ok?

Christian

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-04-23 16:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-22 14:54 [PATCH v2 0/7] loopfs Christian Brauner
2020-04-22 14:54 ` [PATCH v2 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
2020-04-22 16:34   ` Greg Kroah-Hartman
2020-04-22 14:54 ` [PATCH v2 2/7] loopfs: implement loopfs Christian Brauner
2020-04-22 21:52   ` Serge E. Hallyn
2020-04-23 11:24     ` Christian Brauner
2020-04-23 16:17       ` Serge E. Hallyn
2020-04-23 16:29         ` Christian Brauner
2020-04-22 14:54 ` [PATCH v2 3/7] loop: use ns_capable for some loop operations Christian Brauner
2020-04-22 20:50   ` Serge E. Hallyn
2020-04-22 14:54 ` [PATCH v2 4/7] kernfs: handle multiple namespace tags Christian Brauner
2020-04-22 22:01   ` Serge E. Hallyn
2020-04-22 14:54 ` [PATCH v2 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
2020-04-23  1:17   ` Serge E. Hallyn
2020-04-23 11:15     ` Christian Brauner
2020-04-23 16:13       ` Serge E. Hallyn
2020-04-22 14:54 ` [PATCH v2 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
2020-04-23  1:36   ` Serge E. Hallyn
2020-04-22 14:54 ` [PATCH v2 7/7] loopfs: only show devices in their correct instance Christian Brauner
2020-04-23  1:37   ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).