linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] loopfs
@ 2020-04-24 16:20 Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Hey everyone,

This is v3 of loopfs.

The branch is available in three locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
https://gitlab.com/brauner/linux/-/commits/loopfs
https://github.com/brauner/linux/tree/loopfs

I've added a few more people to the Cc that want to make use of this and
I've added the missing ucount part that David pointed out and expanded a
little more on how this is used so this is used. I also removed the
"max" mount option which people argued wasn't very useful, bringing down
the loc count quite a bit actually.

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

Thanks!
Christian

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: https://github.com/kubernetes-sigs/kind/issues/1333
     https://github.com/kubernetes-sigs/kind/issues/1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     https://github.com/moby/moby/issues/27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813

Christian Brauner (7):
  kobject_uevent: remove unneeded netlink_ns check
  loopfs: implement loopfs
  loop: use ns_capable for some loop operations
  kernfs: handle multiple namespace tags
  loop: preserve sysfs backwards compatibility
  loopfs: start attaching correct namespace during loop_add()
  loopfs: only show devices in their correct instance

 Documentation/filesystems/sysfs-tagging.txt |   1 -
 MAINTAINERS                                 |   5 +
 block/genhd.c                               |  85 ++++
 block/partitions/core.c                     |   1 +
 drivers/base/devtmpfs.c                     |   4 +-
 drivers/block/Kconfig                       |   4 +
 drivers/block/Makefile                      |   1 +
 drivers/block/loop.c                        | 205 +++++++--
 drivers/block/loop.h                        |   7 +
 drivers/block/loopfs/Makefile               |   3 +
 drivers/block/loopfs/loopfs.c               | 439 ++++++++++++++++++++
 drivers/block/loopfs/loopfs.h               |  38 ++
 fs/kernfs/dir.c                             |  38 +-
 fs/kernfs/kernfs-internal.h                 |  33 +-
 fs/kernfs/mount.c                           |  11 +-
 fs/sysfs/mount.c                            |  14 +-
 include/linux/device.h                      |   3 +
 include/linux/genhd.h                       |   3 +
 include/linux/kernfs.h                      |  44 +-
 include/linux/kobject_ns.h                  |   7 +-
 include/linux/sysfs.h                       |   8 +-
 include/linux/user_namespace.h              |   3 +
 include/uapi/linux/magic.h                  |   1 +
 kernel/ucount.c                             |   3 +
 lib/kobject.c                               |  17 +-
 lib/kobject_uevent.c                        |   2 +-
 net/core/net-sysfs.c                        |   6 -
 27 files changed, 891 insertions(+), 95 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h


base-commit: ae83d0b416db002fe95601e7f97f64b59514d936
-- 
2.26.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/7] kobject_uevent: remove unneeded netlink_ns check
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 2/7] loopfs: implement loopfs Christian Brauner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Back when I rewrote large chunks of uevent sending I should have removed
the .netlink_ns method completely after having removed it's last user in
[1]. Let's remove it now and also remove the helper associated with it
that is unused too.

Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
Cc: "David S. Miller" <davem@davemloft.net>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 Documentation/filesystems/sysfs-tagging.txt |  1 -
 include/linux/kobject_ns.h                  |  3 ---
 lib/kobject.c                               | 13 -------------
 lib/kobject_uevent.c                        |  2 +-
 net/core/net-sysfs.c                        |  6 ------
 5 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
index c7c8e6438958..51d28dd8b84f 100644
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ b/Documentation/filesystems/sysfs-tagging.txt
@@ -37,6 +37,5 @@ Users of this interface:
 - define a type in the kobj_ns_type enumeration.
 - call kobj_ns_type_register() with its kobj_ns_type_operations which has
   - current_ns() which returns current's namespace
-  - netlink_ns() which returns a socket's namespace
   - initial_ns() which returns the initial namesapce
 - call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 069aa2ebef90..991a9286bcea 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -32,7 +32,6 @@ enum kobj_ns_type {
 /*
  * Callbacks so sysfs can determine namespaces
  *   @grab_current_ns: return a new reference to calling task's namespace
- *   @netlink_ns: return namespace to which a sock belongs (right?)
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
  */
@@ -40,7 +39,6 @@ struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
 	bool (*current_may_mount)(void);
 	void *(*grab_current_ns)(void);
-	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
 };
@@ -52,7 +50,6 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
 bool kobj_ns_current_may_mount(enum kobj_ns_type type);
 void *kobj_ns_grab_current(enum kobj_ns_type type);
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
 void kobj_ns_drop(enum kobj_ns_type type, void *ns);
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 83198cb37d8d..6f07083cc111 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -1092,19 +1092,6 @@ void *kobj_ns_grab_current(enum kobj_ns_type type)
 }
 EXPORT_SYMBOL_GPL(kobj_ns_grab_current);
 
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk)
-{
-	const void *ns = NULL;
-
-	spin_lock(&kobj_ns_type_lock);
-	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
-	    kobj_ns_ops_tbl[type])
-		ns = kobj_ns_ops_tbl[type]->netlink_ns(sk);
-	spin_unlock(&kobj_ns_type_lock);
-
-	return ns;
-}
-
 const void *kobj_ns_initial(enum kobj_ns_type type)
 {
 	const void *ns = NULL;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 7998affa45d4..a45b3eeaa2b9 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -400,7 +400,7 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 	 * are the only tag relevant here since we want to decide which
 	 * network namespaces to broadcast the uevent into.
 	 */
-	if (ops && ops->netlink_ns && kobj->ktype->namespace)
+	if (ops && kobj->ktype->namespace)
 		if (ops->type == KOBJ_NS_TYPE_NET)
 			net = kobj->ktype->namespace(kobj);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4773ad6ec111..3fa35a3c843a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1694,16 +1694,10 @@ static const void *net_initial_ns(void)
 	return &init_net;
 }
 
-static const void *net_netlink_ns(struct sock *sk)
-{
-	return sock_net(sk);
-}
-
 const struct kobj_ns_type_operations net_ns_type_operations = {
 	.type = KOBJ_NS_TYPE_NET,
 	.current_may_mount = net_current_may_mount,
 	.grab_current_ns = net_grab_current_ns,
-	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
 	.drop_ns = net_drop_ns,
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/7] loopfs: implement loopfs
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 3/7] loop: use ns_capable for some loop operations Christian Brauner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.

Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].

Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).

The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.

Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance.  With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].

The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.

Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.

The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.

In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.

Note that in __loop_clr_fd() we now need not just check whether bdev is
valid but also whether bdev->bd_disk is valid. This wasn't necessary
before because in order to call LOOP_CLR_FD the loop device would need
to be open and thus bdev->bd_disk was guaranteed to be allocated. For
loopfs loop devices we allow callers to simply unlink them just as we do
for binderfs binder devices and we do also need to account for the case
where a loopfs superblock is shutdown while backing files might still be
associated with some loop devices. In such cases no bd_disk device will
be attached to bdev. This is not in itself noteworthy it's more about
documenting the "why" of the added bdev->bd_disk check for posterity.

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: https://github.com/kubernetes-sigs/kind/issues/1333
     https://github.com/kubernetes-sigs/kind/issues/1248
     https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
     https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
     https://gitlab.com/gitlab-com/support-forum/issues/3732
     https://github.com/moby/moby/issues/27886
     https://twitter.com/_AkihiroSuda_/status/1249664478267854848
     https://serverfault.com/questions/701384/loop-device-in-a-linux-container
     https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
     https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steve Barber <smbarber@google.com>
Cc: Filipe Brandenburger <filbranden@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benjamin Elder <bentheelder@google.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dylan Reid <dgreid@google.com>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Akihiro Suda <suda.kyoto@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- David Rheinsberg <david.rheinsberg@gmail.com> /
  Christian Brauner <christian.brauner@ubuntu.com>:
  - Correctly cleanup loop devices that are in-use after the loopfs
    instance has been shut down. This is important for some use-cases
    that David pointed out where they effectively create a loopfs
    instance, allocate devices and drop unnecessary references to it.
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Replace lo_loopfs_i inode member in struct loop_device with a custom
    struct lo_info pointer which is only allocated for loopfs loop
    devices.

/* v3 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Fix loopfs_access() to not care about non-loopfs devices.
  - Stash refcounted sbinfo in lo_info to simplify retrieval of user
    namespace. This way each loopfs instance just takes a single
    reference for each to the user namespace that is dropped when the
    last loop device is removed. This puts us on the safe side. (Thanks
    to Serge for making me aware of this issue.
- David Rheinsberg <david.rheinsberg@gmail.com> /
  Serge Hallyn <serge@hallyn.com>:
  - Remove "max" mount option.
---
 MAINTAINERS                    |   5 +
 drivers/block/Kconfig          |   4 +
 drivers/block/Makefile         |   1 +
 drivers/block/loop.c           | 182 +++++++++++---
 drivers/block/loop.h           |   7 +
 drivers/block/loopfs/Makefile  |   3 +
 drivers/block/loopfs/loopfs.c  | 431 +++++++++++++++++++++++++++++++++
 drivers/block/loopfs/loopfs.h  |  37 +++
 include/linux/user_namespace.h |   3 +
 include/uapi/linux/magic.h     |   1 +
 kernel/ucount.c                |   3 +
 11 files changed, 646 insertions(+), 31 deletions(-)
 create mode 100644 drivers/block/loopfs/Makefile
 create mode 100644 drivers/block/loopfs/loopfs.c
 create mode 100644 drivers/block/loopfs/loopfs.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b816a453b10e..560b37a65bce 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9957,6 +9957,11 @@ W:	http://www.avagotech.com/support/
 F:	drivers/message/fusion/
 F:	drivers/scsi/mpt3sas/
 
+LOOPFS FILE SYSTEM
+M:	Christian Brauner <christian.brauner@ubuntu.com>
+S:	Supported
+F:	drivers/block/loopfs/
+
 LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers
 M:	Matthew Wilcox <willy@infradead.org>
 L:	linux-scsi@vger.kernel.org
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 025b1b77b11a..d7ff37d795ad 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -214,6 +214,10 @@ config BLK_DEV_LOOP
 
 	  Most users will answer N here.
 
+config BLK_DEV_LOOPFS
+	bool "Loopback device virtual filesystem support"
+	depends on BLK_DEV_LOOP=y
+
 config BLK_DEV_LOOP_MIN_COUNT
 	int "Number of loop devices to pre-create at init time"
 	depends on BLK_DEV_LOOP
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 795facd8cf19..7052be26aa8b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
 obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
 obj-$(CONFIG_BLK_DEV_RBD)     += rbd.o
 obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)	+= mtip32xx/
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs/
 
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_ZRAM) += zram/
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..0c99ee0b42a8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -81,6 +81,10 @@
 
 #include "loop.h"
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+#include "loopfs/loopfs.h"
+#endif
+
 #include <linux/uaccess.h>
 
 static DEFINE_IDR(loop_index_idr);
@@ -1115,6 +1119,24 @@ loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer,
 	return err;
 }
 
+static void loop_remove(struct loop_device *lo)
+{
+	del_gendisk(lo->lo_disk);
+	blk_cleanup_queue(lo->lo_queue);
+	blk_mq_free_tag_set(&lo->tag_set);
+	put_disk(lo->lo_disk);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	loopfs_remove(lo);
+#endif
+	kfree(lo);
+}
+
+static inline void __loop_remove(struct loop_device *lo)
+{
+	idr_remove(&loop_index_idr, lo->lo_number);
+	loop_remove(lo);
+}
+
 static int __loop_clr_fd(struct loop_device *lo, bool release)
 {
 	struct file *filp = NULL;
@@ -1164,7 +1186,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	}
 	set_capacity(lo->lo_disk, 0);
 	loop_sysfs_exit(lo);
-	if (bdev) {
+	if (bdev && bdev->bd_disk) {
 		bd_set_size(bdev, 0);
 		/* let user-space know about this change */
 		kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
@@ -1174,7 +1196,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	module_put(THIS_MODULE);
 	blk_mq_unfreeze_queue(lo->lo_queue);
 
-	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
+	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev && bdev->bd_disk;
 	lo_number = lo->lo_number;
 	loop_unprepare_queue(lo);
 out_unlock:
@@ -1213,7 +1235,12 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
 	lo->lo_flags = 0;
 	if (!part_shift)
 		lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
-	lo->lo_state = Lo_unbound;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (loopfs_wants_remove(lo))
+		__loop_remove(lo);
+	else
+#endif
+		lo->lo_state = Lo_unbound;
 	mutex_unlock(&loop_ctl_mutex);
 
 	/*
@@ -1259,6 +1286,74 @@ static int loop_clr_fd(struct loop_device *lo)
 	return __loop_clr_fd(lo, false);
 }
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+int loopfs_rundown_locked(struct loop_device *lo)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(!loopfs_device(lo)))
+		return -EINVAL;
+
+	ret = mutex_lock_killable(&loop_ctl_mutex);
+	if (ret)
+		return ret;
+
+	if (lo->lo_state != Lo_unbound || atomic_read(&lo->lo_refcnt) > 0) {
+		ret = -EBUSY;
+	} else {
+		/*
+		 * Since the device is unbound it has no associated backing
+		 * file and we can safely set Lo_rundown to prevent it from
+		 * being found. Actual cleanup happens during inode eviction.
+		 */
+		lo->lo_state = Lo_rundown;
+		ret = 0;
+	}
+
+	mutex_unlock(&loop_ctl_mutex);
+	return ret;
+}
+
+/**
+ * loopfs_evict_locked() - remove loop device or mark inactive
+ * @lo:	loopfs loop device
+ *
+ * This function will remove a loop device. If it has no users
+ * and is bound the backing file will be cleaned up. If the loop
+ * device has users it will be marked for auto cleanup.
+ * This function is only called when a loopfs instance is shutdown
+ * when all references to it from this loopfs instance have been
+ * dropped. If there are still any references to it cleanup will
+ * happen in lo_release().
+ */
+void loopfs_evict_locked(struct loop_device *lo)
+{
+	struct lo_loopfs *lo_info;
+	struct inode *lo_inode;
+
+	WARN_ON_ONCE(!loopfs_device(lo));
+
+	mutex_lock(&loop_ctl_mutex);
+	lo_info = lo->lo_info;
+	lo_inode = lo_info->lo_inode;
+	lo_info->lo_inode = NULL;
+	lo_info->lo_flags |= LOOPFS_FLAGS_INACTIVE;
+
+	if (atomic_read(&lo->lo_refcnt) > 0) {
+		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
+	} else {
+		lo->lo_state = Lo_rundown;
+		lo->lo_disk->private_data = NULL;
+		lo_inode->i_private = NULL;
+
+		mutex_unlock(&loop_ctl_mutex);
+		__loop_clr_fd(lo, false);
+		return;
+	}
+	mutex_unlock(&loop_ctl_mutex);
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int
 loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 {
@@ -1842,7 +1937,7 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 
 	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
 		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
+			goto out_remove;
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&loop_ctl_mutex);
 		/*
@@ -1860,6 +1955,12 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		blk_mq_unfreeze_queue(lo->lo_queue);
 	}
 
+out_remove:
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (lo->lo_state != Lo_bound && loopfs_wants_remove(lo))
+		__loop_remove(lo);
+#endif
+
 out_unlock:
 	mutex_unlock(&loop_ctl_mutex);
 }
@@ -2006,7 +2107,7 @@ static const struct blk_mq_ops loop_mq_ops = {
 	.complete	= lo_complete_rq,
 };
 
-static int loop_add(struct loop_device **l, int i)
+static int loop_add(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	struct gendisk *disk;
@@ -2096,7 +2197,17 @@ static int loop_add(struct loop_device **l, int i)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+
 	add_disk(disk);
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	err = loopfs_add(lo, inode, disk_devt(disk));
+	if (err) {
+		__loop_remove(lo);
+		goto out;
+	}
+#endif
+
 	*l = lo;
 	return lo->lo_number;
 
@@ -2112,36 +2223,41 @@ static int loop_add(struct loop_device **l, int i)
 	return err;
 }
 
-static void loop_remove(struct loop_device *lo)
-{
-	del_gendisk(lo->lo_disk);
-	blk_cleanup_queue(lo->lo_queue);
-	blk_mq_free_tag_set(&lo->tag_set);
-	put_disk(lo->lo_disk);
-	kfree(lo);
-}
+struct find_free_cb_data {
+	struct loop_device **l;
+	struct inode *inode;
+};
 
 static int find_free_cb(int id, void *ptr, void *data)
 {
 	struct loop_device *lo = ptr;
-	struct loop_device **l = data;
+	struct find_free_cb_data *cb_data = data;
 
-	if (lo->lo_state == Lo_unbound) {
-		*l = lo;
-		return 1;
-	}
-	return 0;
+	if (lo->lo_state != Lo_unbound)
+		return 0;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	if (!loopfs_access(cb_data->inode, lo))
+		return 0;
+#endif
+
+	*cb_data->l = lo;
+	return 1;
 }
 
-static int loop_lookup(struct loop_device **l, int i)
+static int loop_lookup(struct loop_device **l, int i, struct inode *inode)
 {
 	struct loop_device *lo;
 	int ret = -ENODEV;
 
 	if (i < 0) {
 		int err;
+		struct find_free_cb_data cb_data = {
+			.l = &lo,
+			.inode = inode,
+		};
 
-		err = idr_for_each(&loop_index_idr, &find_free_cb, &lo);
+		err = idr_for_each(&loop_index_idr, &find_free_cb, &cb_data);
 		if (err == 1) {
 			*l = lo;
 			ret = lo->lo_number;
@@ -2152,6 +2268,11 @@ static int loop_lookup(struct loop_device **l, int i)
 	/* lookup and return a specific i */
 	lo = idr_find(&loop_index_idr, i);
 	if (lo) {
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		if (!loopfs_access(inode, lo))
+			return -EACCES;
+#endif
+
 		*l = lo;
 		ret = lo->lo_number;
 	}
@@ -2166,9 +2287,9 @@ static struct kobject *loop_probe(dev_t dev, int *part, void *data)
 	int err;
 
 	mutex_lock(&loop_ctl_mutex);
-	err = loop_lookup(&lo, MINOR(dev) >> part_shift);
+	err = loop_lookup(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
-		err = loop_add(&lo, MINOR(dev) >> part_shift);
+		err = loop_add(&lo, MINOR(dev) >> part_shift, NULL);
 	if (err < 0)
 		kobj = NULL;
 	else
@@ -2192,15 +2313,15 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 	ret = -ENOSYS;
 	switch (cmd) {
 	case LOOP_CTL_ADD:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret >= 0) {
 			ret = -EEXIST;
 			break;
 		}
-		ret = loop_add(&lo, parm);
+		ret = loop_add(&lo, parm, file_inode(file));
 		break;
 	case LOOP_CTL_REMOVE:
-		ret = loop_lookup(&lo, parm);
+		ret = loop_lookup(&lo, parm, file_inode(file));
 		if (ret < 0)
 			break;
 		if (lo->lo_state != Lo_unbound) {
@@ -2212,14 +2333,13 @@ static long loop_control_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 		lo->lo_disk->private_data = NULL;
-		idr_remove(&loop_index_idr, lo->lo_number);
-		loop_remove(lo);
+		__loop_remove(lo);
 		break;
 	case LOOP_CTL_GET_FREE:
-		ret = loop_lookup(&lo, -1);
+		ret = loop_lookup(&lo, -1, file_inode(file));
 		if (ret >= 0)
 			break;
-		ret = loop_add(&lo, -1);
+		ret = loop_add(&lo, -1, file_inode(file));
 	}
 	mutex_unlock(&loop_ctl_mutex);
 
@@ -2307,7 +2427,7 @@ static int __init loop_init(void)
 	/* pre-create number of devices given by config or max_loop */
 	mutex_lock(&loop_ctl_mutex);
 	for (i = 0; i < nr; i++)
-		loop_add(&lo, i);
+		loop_add(&lo, i, NULL);
 	mutex_unlock(&loop_ctl_mutex);
 
 	printk(KERN_INFO "loop: module loaded\n");
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index af75a5ee4094..2b3cd5bac71e 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -17,6 +17,10 @@
 #include <linux/kthread.h>
 #include <uapi/linux/loop.h>
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+#include "loopfs/loopfs.h"
+#endif
+
 /* Possible states of device */
 enum {
 	Lo_unbound,
@@ -62,6 +66,9 @@ struct loop_device {
 	struct request_queue	*lo_queue;
 	struct blk_mq_tag_set	tag_set;
 	struct gendisk		*lo_disk;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct lo_loopfs	*lo_info;
+#endif
 };
 
 struct loop_cmd {
diff --git a/drivers/block/loopfs/Makefile b/drivers/block/loopfs/Makefile
new file mode 100644
index 000000000000..87ec703b662e
--- /dev/null
+++ b/drivers/block/loopfs/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+loopfs-y			:= loopfs.o
+obj-$(CONFIG_BLK_DEV_LOOPFS)	+= loopfs.o
diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
new file mode 100644
index 000000000000..09cd5a919ea2
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.c
@@ -0,0 +1,431 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/fs.h>
+#include <linux/fs_parser.h>
+#include <linux/fsnotify.h>
+#include <linux/genhd.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/magic.h>
+#include <linux/major.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include "../loop.h"
+#include "loopfs.h"
+
+#define FIRST_INODE 1
+#define SECOND_INODE 2
+#define INODE_OFFSET 3
+
+struct loopfs_info {
+	kuid_t root_uid;
+	kgid_t root_gid;
+	struct dentry *control_dentry;
+	struct user_namespace *user_ns;
+	atomic_t users;
+};
+
+static inline struct loopfs_info *LOOPFS_SB(const struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+struct super_block *loopfs_i_sb(const struct inode *inode)
+{
+	if (inode && inode->i_sb->s_magic == LOOPFS_SUPER_MAGIC)
+		return inode->i_sb;
+
+	return NULL;
+}
+
+bool loopfs_device(const struct loop_device *lo)
+{
+	return lo->lo_info != NULL;
+}
+
+struct user_namespace *loopfs_ns(const struct loop_device *lo)
+{
+	if (loopfs_device(lo))
+		return lo->lo_info->sbi->user_ns;
+	return &init_user_ns;
+}
+
+bool loopfs_access(const struct inode *first, struct loop_device *lo)
+{
+	struct inode *second = NULL;
+
+	if (loopfs_device(lo)) {
+		second = lo->lo_info->lo_inode;
+		if (!second)
+			return false; /* loopfs already gone */
+	}
+	return loopfs_i_sb(first) == loopfs_i_sb(second);
+}
+
+bool loopfs_wants_remove(const struct loop_device *lo)
+{
+	return loopfs_device(lo) &&
+	       (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE);
+}
+
+/**
+ * loopfs_add - allocate inode from super block of a loopfs mount
+ * @lo:		loop device for which we are creating a new device entry
+ * @ref_inode:	inode from wich the super block will be taken
+ * @device_nr:  device number of the associated disk device
+ *
+ * This function creates a new device node for @lo.
+ * Minor numbers are limited and tracked globally. The
+ * function will stash a struct loop_device for the specific loop
+ * device in i_private of the inode.
+ * It will go on to allocate a new inode from the super block of the
+ * filesystem mount, stash a struct loop_device in its i_private field
+ * and attach a dentry to that inode.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+int loopfs_add(struct loop_device *lo, struct inode *ref_inode, dev_t device_nr)
+{
+	int ret;
+	char name[DISK_NAME_LEN];
+	struct super_block *sb;
+	struct loopfs_info *info;
+	struct dentry *root, *dentry;
+	struct inode *inode;
+	struct lo_loopfs *lo_info;
+
+	sb = loopfs_i_sb(ref_inode);
+	if (!sb)
+		return 0;
+
+	if (MAJOR(device_nr) != LOOP_MAJOR)
+		return -EINVAL;
+
+	lo_info = kzalloc(sizeof(struct lo_loopfs), GFP_KERNEL);
+	if (!lo_info) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	info = LOOPFS_SB(sb);
+	lo_info->lo_ucount = inc_ucount(sb->s_user_ns,
+					info->root_uid, UCOUNT_LOOP_DEVICES);
+	if (!lo_info->lo_ucount) {
+		ret = -ENOSPC;
+		goto err;
+	}
+
+	if (snprintf(name, sizeof(name), "loop%d", lo->lo_number) >= sizeof(name)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	inode = new_inode(sb);
+	if (!inode) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop device in this loopfs instance is opened.
+	 */
+	inode->i_ino = MINOR(device_nr) + INODE_OFFSET;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+	init_special_inode(inode, S_IFBLK | 0600, device_nr);
+
+	root = sb->s_root;
+	inode_lock(d_inode(root));
+	/* look it up */
+	dentry = lookup_one_len(name, root, strlen(name));
+	if (IS_ERR(dentry)) {
+		inode_unlock(d_inode(root));
+		iput(inode);
+		ret = PTR_ERR(dentry);
+		goto err;
+	}
+
+	if (d_really_is_positive(dentry)) {
+		/* already exists */
+		dput(dentry);
+		inode_unlock(d_inode(root));
+		iput(inode);
+		ret = -EEXIST;
+		goto err;
+	}
+
+	d_instantiate(dentry, inode);
+	fsnotify_create(d_inode(root), dentry);
+	inode_unlock(d_inode(root));
+
+	lo_info->lo_inode = inode;
+	lo->lo_info = lo_info;
+	atomic_inc(&info->users);
+	lo->lo_info->sbi = info;
+	inode->i_private = lo;
+
+	return 0;
+
+err:
+	if (lo_info->lo_ucount)
+		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
+	kfree(lo_info);
+	return ret;
+}
+
+void loopfs_remove(struct loop_device *lo)
+{
+	struct lo_loopfs *lo_info = lo->lo_info;
+	struct loopfs_info *sbi;
+	struct inode *inode;
+	struct super_block *sb;
+	struct dentry *root, *dentry;
+
+	if (!lo_info)
+		return;
+
+	inode = lo_info->lo_inode;
+	if (!inode || !S_ISBLK(inode->i_mode) || imajor(inode) != LOOP_MAJOR)
+		goto out;
+
+	sb = loopfs_i_sb(inode);
+	lo_info->lo_inode = NULL;
+
+	/*
+	 * The root dentry is always the parent dentry since we don't allow
+	 * creation of directories.
+	 */
+	root = sb->s_root;
+
+	inode_lock(d_inode(root));
+	dentry = d_find_any_alias(inode);
+	if (dentry && simple_positive(dentry)) {
+		simple_unlink(d_inode(root), dentry);
+		d_delete(dentry);
+	}
+	dput(dentry);
+	inode_unlock(d_inode(root));
+
+out:
+	if (lo_info->lo_ucount)
+		dec_ucount(lo_info->lo_ucount, UCOUNT_LOOP_DEVICES);
+	sbi = lo_info->sbi;
+	if (atomic_dec_and_test(&sbi->users)) {
+		put_user_ns(sbi->user_ns);
+		kfree(sbi);
+	}
+	kfree(lo->lo_info);
+	lo->lo_info = NULL;
+}
+
+/**
+ * loopfs_loop_ctl_create - create a new loop-control device
+ * @sb: super block of the loopfs mount
+ *
+ * This function creates a new loop-control device node in the loopfs mount
+ * referred to by @sb.
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+static int loopfs_loop_ctl_create(struct super_block *sb)
+{
+	struct dentry *dentry;
+	struct inode *inode = NULL;
+	struct dentry *root = sb->s_root;
+	struct loopfs_info *info = sb->s_fs_info;
+
+	if (info->control_dentry)
+		return 0;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = SECOND_INODE;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	init_special_inode(inode, S_IFCHR | 0600,
+			   MKDEV(MISC_MAJOR, LOOP_CTRL_MINOR));
+	/*
+	 * The i_fop field will be set to the correct fops by the device layer
+	 * when the loop-control device in this loopfs instance is opened.
+	 */
+	inode->i_uid = info->root_uid;
+	inode->i_gid = info->root_gid;
+
+	dentry = d_alloc_name(root, "loop-control");
+	if (!dentry) {
+		iput(inode);
+		return -ENOMEM;
+	}
+
+	info->control_dentry = dentry;
+	d_add(dentry, inode);
+
+	return 0;
+}
+
+static inline bool is_loopfs_control_device(const struct dentry *dentry)
+{
+	return LOOPFS_SB(dentry->d_sb)->control_dentry == dentry;
+}
+
+static int loopfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+			 struct inode *new_dir, struct dentry *new_dentry,
+			 unsigned int flags)
+{
+	if (is_loopfs_control_device(old_dentry) ||
+	    is_loopfs_control_device(new_dentry))
+		return -EPERM;
+
+	return simple_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
+}
+
+static int loopfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int ret;
+	struct loop_device *lo;
+
+	if (is_loopfs_control_device(dentry))
+		return -EPERM;
+
+	lo = d_inode(dentry)->i_private;
+	ret = loopfs_rundown_locked(lo);
+	if (ret)
+		return ret;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations loopfs_dir_inode_operations = {
+	.lookup = simple_lookup,
+	.rename = loopfs_rename,
+	.unlink = loopfs_unlink,
+};
+
+static void loopfs_evict_inode(struct inode *inode)
+{
+	struct loop_device *lo = inode->i_private;
+
+	clear_inode(inode);
+
+	if (lo && S_ISBLK(inode->i_mode) && imajor(inode) == LOOP_MAJOR) {
+		loopfs_evict_locked(lo);
+		inode->i_private = NULL;
+	}
+}
+
+static const struct super_operations loopfs_super_ops = {
+	.evict_inode    = loopfs_evict_inode,
+	.statfs         = simple_statfs,
+};
+
+static int loopfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct loopfs_info *info;
+	struct inode *inode = NULL;
+
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+
+	sb->s_iflags &= ~SB_I_NODEV;
+	sb->s_iflags |= SB_I_NOEXEC;
+	sb->s_magic = LOOPFS_SUPER_MAGIC;
+	sb->s_op = &loopfs_super_ops;
+	sb->s_time_gran = 1;
+
+	sb->s_fs_info = kzalloc(sizeof(struct loopfs_info), GFP_KERNEL);
+	if (!sb->s_fs_info)
+		return -ENOMEM;
+	info = sb->s_fs_info;
+
+	info->root_gid = make_kgid(sb->s_user_ns, 0);
+	if (!gid_valid(info->root_gid))
+		info->root_gid = GLOBAL_ROOT_GID;
+	info->root_uid = make_kuid(sb->s_user_ns, 0);
+	if (!uid_valid(info->root_uid))
+		info->root_uid = GLOBAL_ROOT_UID;
+	info->user_ns = get_user_ns(sb->s_user_ns);
+	atomic_set(&info->users, 1);
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = FIRST_INODE;
+	inode->i_fop = &simple_dir_operations;
+	inode->i_mode = S_IFDIR | 0755;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	inode->i_op = &loopfs_dir_inode_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return loopfs_loop_ctl_create(sb);
+}
+
+static int loopfs_fs_context_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, loopfs_fill_super);
+}
+
+static void loopfs_fs_context_free(struct fs_context *fc)
+{
+	struct loopfs_info *sbi = fc->s_fs_info;
+
+	fc->s_fs_info = NULL;
+	if (sbi && atomic_dec_and_test(&sbi->users)) {
+		put_user_ns(sbi->user_ns);
+		kfree(sbi);
+	}
+}
+
+static const struct fs_context_operations loopfs_fs_context_ops = {
+	.free		= loopfs_fs_context_free,
+	.get_tree	= loopfs_fs_context_get_tree,
+};
+
+static int loopfs_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &loopfs_fs_context_ops;
+	return 0;
+}
+
+static void loopfs_kill_sb(struct super_block *sb)
+{
+	struct loopfs_info *sbi = sb->s_fs_info;
+
+	sb->s_fs_info = NULL;
+	if (atomic_dec_and_test(&sbi->users)) {
+		put_user_ns(sbi->user_ns);
+		kfree(sbi);
+	}
+
+	kill_litter_super(sb);
+}
+
+static struct file_system_type loop_fs_type = {
+	.name			= "loop",
+	.init_fs_context	= loopfs_init_fs_context,
+	.kill_sb		= loopfs_kill_sb,
+	.fs_flags		= FS_USERNS_MOUNT,
+};
+
+int __init init_loopfs(void)
+{
+	init_user_ns.ucount_max[UCOUNT_LOOP_DEVICES] = 255;
+	return register_filesystem(&loop_fs_type);
+}
+
+module_init(init_loopfs);
+MODULE_AUTHOR("Christian Brauner <christian.brauner@ubuntu.com>");
+MODULE_DESCRIPTION("Loop device filesystem");
diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
new file mode 100644
index 000000000000..225d844f5a01
--- /dev/null
+++ b/drivers/block/loopfs/loopfs.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_LOOPFS_FS_H
+#define _LINUX_LOOPFS_FS_H
+
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/user_namespace.h>
+
+struct loop_device;
+
+#ifdef CONFIG_BLK_DEV_LOOPFS
+
+#define LOOPFS_FLAGS_INACTIVE (1 << 0)
+
+struct lo_loopfs {
+	struct loopfs_info *sbi;
+	struct ucounts *lo_ucount;
+	struct inode *lo_inode;
+	int lo_flags;
+};
+
+extern struct super_block *loopfs_i_sb(const struct inode *inode);
+extern bool loopfs_device(const struct loop_device *lo);
+extern struct user_namespace *loopfs_ns(const struct loop_device *lo);
+extern bool loopfs_access(const struct inode *first, struct loop_device *lo);
+extern int loopfs_add(struct loop_device *lo, struct inode *ref_inode,
+		      dev_t device_nr);
+extern void loopfs_remove(struct loop_device *lo);
+extern bool loopfs_wants_remove(const struct loop_device *lo);
+extern void loopfs_evict_locked(struct loop_device *lo);
+extern int loopfs_rundown_locked(struct loop_device *lo);
+
+#endif
+
+#endif /* _LINUX_LOOPFS_FS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..04a4891765c0 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -49,6 +49,9 @@ enum ucount_type {
 #ifdef CONFIG_INOTIFY_USER
 	UCOUNT_INOTIFY_INSTANCES,
 	UCOUNT_INOTIFY_WATCHES,
+#endif
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	UCOUNT_LOOP_DEVICES,
 #endif
 	UCOUNT_COUNTS,
 };
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..0817d093a012 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -75,6 +75,7 @@
 #define BINFMTFS_MAGIC          0x42494e4d
 #define DEVPTS_SUPER_MAGIC	0x1cd1
 #define BINDERFS_SUPER_MAGIC	0x6c6f6f70
+#define LOOPFS_SUPER_MAGIC	0x6c6f6f71
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define PIPEFS_MAGIC            0x50495045
 #define PROC_SUPER_MAGIC	0x9fa0
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 11b1596e2542..fb0f6394a8bb 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -73,6 +73,9 @@ static struct ctl_table user_table[] = {
 #ifdef CONFIG_INOTIFY_USER
 	UCOUNT_ENTRY("max_inotify_instances"),
 	UCOUNT_ENTRY("max_inotify_watches"),
+#endif
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	UCOUNT_ENTRY("max_loop_devices"),
 #endif
 	{ }
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/7] loop: use ns_capable for some loop operations
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 2/7] loopfs: implement loopfs Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 4/7] kernfs: handle multiple namespace tags Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

The following  LOOP_GET_STATUS, LOOP_SET_STATUS, and LOOP_SET_BLOCK_SIZE
operations are now allowed in non-initial namespaces. Most other
operations were already possible before.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Kellner <ckellner@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Rheinsberg <david.rheinsberg@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Adapated loop_capable() based on changes in the loopfs
    implementation patchset. Otherwise it is functionally equivalent to
    the v1 version.

/* v3 */
unchanged
---
 drivers/block/loop.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0c99ee0b42a8..40705f5aeabd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1352,6 +1352,16 @@ void loopfs_evict_locked(struct loop_device *lo)
 	}
 	mutex_unlock(&loop_ctl_mutex);
 }
+
+static bool loop_capable(const struct loop_device *lo, int cap)
+{
+	return ns_capable(loopfs_ns(lo), cap);
+}
+#else /* !CONFIG_BLK_DEV_LOOPFS */
+static inline bool loop_capable(const struct loop_device *lo, int cap)
+{
+	return capable(cap);
+}
 #endif /* CONFIG_BLK_DEV_LOOPFS */
 
 static int
@@ -1368,7 +1378,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return err;
 	if (lo->lo_encrypt_key_size &&
 	    !uid_eq(lo->lo_key_owner, uid) &&
-	    !capable(CAP_SYS_ADMIN)) {
+	    !loop_capable(lo, CAP_SYS_ADMIN)) {
 		err = -EPERM;
 		goto out_unlock;
 	}
@@ -1499,7 +1509,7 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info)
 	memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE);
 	info->lo_encrypt_type =
 		lo->lo_encryption ? lo->lo_encryption->number : 0;
-	if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) {
+	if (lo->lo_encrypt_key_size && loop_capable(lo, CAP_SYS_ADMIN)) {
 		info->lo_encrypt_key_size = lo->lo_encrypt_key_size;
 		memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
 		       lo->lo_encrypt_key_size);
@@ -1723,7 +1733,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_clr_fd(lo);
 	case LOOP_SET_STATUS:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status_old(lo,
 					(struct loop_info __user *)arg);
 		}
@@ -1732,7 +1742,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 		return loop_get_status_old(lo, (struct loop_info __user *) arg);
 	case LOOP_SET_STATUS64:
 		err = -EPERM;
-		if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) {
+		if ((mode & FMODE_WRITE) || loop_capable(lo, CAP_SYS_ADMIN)) {
 			err = loop_set_status64(lo,
 					(struct loop_info64 __user *) arg);
 		}
@@ -1742,7 +1752,7 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
 	case LOOP_SET_CAPACITY:
 	case LOOP_SET_DIRECT_IO:
 	case LOOP_SET_BLOCK_SIZE:
-		if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
+		if (!(mode & FMODE_WRITE) && !loop_capable(lo, CAP_SYS_ADMIN))
 			return -EPERM;
 		/* Fall through */
 	default:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/7] kernfs: handle multiple namespace tags
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
                   ` (2 preceding siblings ...)
  2020-04-24 16:20 ` [PATCH v3 3/7] loop: use ns_capable for some loop operations Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Since [1] kernfs supports namespace tags. This feature is essential to
enable sysfs to present different views of on various parts depending on
the namespace tag. For example, the /sys/class/net/ directory will only
show network devices that belong to the network namespace that sysfs was
mounted in. This is achieved by stashing a reference to the network
namespace of the task mounting sysfs in the super block. And when a
lookup operation is performed on e.g. /sys/class/net/ kernfs will
compare the network namespace tag of the kernfs_node associated with the
device and kobject of the network device to the network namespace of the
network device. This ensures that only network devices owned by the
network namespace sysfs was mounted in are shown, a feature which is
essential to containers.
For loopfs to show correct permissions in sysfs just as with network
devices we need to be able to tag kernfs_super_info with additional
namespaces. This extension was even already mentioned in a comment to
struct kernfs_super_info:
  /*
   * Each sb is associated with one namespace tag, currently the
   * network namespace of the task which mounted this kernfs
   * instance.  If multiple tags become necessary, make the following
   * an array and compare kernfs_node tag against every entry.
   */
This patch extends the kernfs_super_info and kernfs_fs_context ns
pointers to fixed-size arrays of namespace tags. The size is taken from
the namespaces currently supported by kobjects, i.e. we don't extend it
to cover all namespace but only the ones kernfs needs to support.
In addition, the kernfs_node struct gains an additional member that
indicates the type of namespace this kernfs_node was tagged with. This
allows us to simply retrieve the correct namespace tag from the
kernfs_fs_context and kernfs_super_info ns array with a simple indexing
operation. This has the advantage that we can just keep passing down the
correct namespace instead of passing down the array.

[1]: 608b4b9548de ("netns: Teach network device kobjects which namespace they are in.")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/kernfs/dir.c             |  6 +++---
 fs/kernfs/kernfs-internal.h |  9 ++++-----
 fs/kernfs/mount.c           | 11 +++++++----
 fs/sysfs/mount.c            | 10 +++++-----
 include/linux/kernfs.h      | 22 ++++++++++++++--------
 include/linux/sysfs.h       |  8 +++++---
 lib/kobject.c               |  2 +-
 7 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..1f2d894ae454 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -576,7 +576,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 
 	/* The kernfs node has been moved to a different namespace */
 	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns != kn->ns)
+	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
 		goto out_bad;
 
 	mutex_unlock(&kernfs_mutex);
@@ -1087,7 +1087,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dir->i_sb)->ns;
+		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
 
@@ -1673,7 +1673,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	mutex_lock(&kernfs_mutex);
 
 	if (kernfs_ns_enabled(parent))
-		ns = kernfs_info(dentry->d_sb)->ns;
+		ns = kernfs_info(dentry->d_sb)->ns[parent->ns_type];
 
 	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
 	     pos;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7ee97ef59184..7c972c00f84a 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -16,6 +16,7 @@
 #include <linux/xattr.h>
 
 #include <linux/kernfs.h>
+#include <linux/kobject_ns.h>
 #include <linux/fs_context.h>
 
 struct kernfs_iattrs {
@@ -62,12 +63,10 @@ struct kernfs_super_info {
 	struct kernfs_root	*root;
 
 	/*
-	 * Each sb is associated with one namespace tag, currently the
-	 * network namespace of the task which mounted this kernfs
-	 * instance.  If multiple tags become necessary, make the following
-	 * an array and compare kernfs_node tag against every entry.
+	 * Each sb can be associated with namespace tags. They will be used
+	 * to compare kernfs_node tags against relevant entries.
 	 */
-	const void		*ns;
+	const void		*ns[KOBJ_NS_TYPES];
 
 	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
 	struct list_head	node;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10..dc4ee0f0a597 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -279,14 +279,15 @@ static int kernfs_test_super(struct super_block *sb, struct fs_context *fc)
 	struct kernfs_super_info *sb_info = kernfs_info(sb);
 	struct kernfs_super_info *info = fc->s_fs_info;
 
-	return sb_info->root == info->root && sb_info->ns == info->ns;
+	return sb_info->root == info->root &&
+	       memcmp(sb_info->ns, info->ns, sizeof(sb_info->ns)) == 0;
 }
 
 static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	kfc->ns_tag = NULL;
+	memset(kfc->ns_tag, 0, sizeof(kfc->ns_tag));
 	return set_anon_super_fc(sb, fc);
 }
 
@@ -296,7 +297,7 @@ static int kernfs_set_super(struct super_block *sb, struct fs_context *fc)
  *
  * Return the namespace tag associated with kernfs super_block @sb.
  */
-const void *kernfs_super_ns(struct super_block *sb)
+const void **kernfs_super_ns(struct super_block *sb)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
 
@@ -324,7 +325,9 @@ int kernfs_get_tree(struct fs_context *fc)
 		return -ENOMEM;
 
 	info->root = kfc->root;
-	info->ns = kfc->ns_tag;
+	BUILD_BUG_ON(sizeof(info->ns) != sizeof(kfc->ns_tag));
+	memcpy(info->ns, kfc->ns_tag, sizeof(info->ns));
+
 	INIT_LIST_HEAD(&info->node);
 
 	fc->s_fs_info = info;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index db81cfbab9d6..711844a97b4c 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -41,8 +41,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 {
 	struct kernfs_fs_context *kfc = fc->fs_private;
 
-	if (kfc->ns_tag)
-		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
+		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -66,7 +66,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 	if (!kfc)
 		return -ENOMEM;
 
-	kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -81,10 +81,10 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 
 static void sysfs_kill_sb(struct super_block *sb)
 {
-	void *ns = (void *)kernfs_super_ns(sb);
+	void **ns = (void **)kernfs_super_ns(sb);
 
+	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
 	kernfs_kill_sb(sb);
-	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
 }
 
 static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 89f6a4214a70..d0544f2e0c99 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -16,6 +16,7 @@
 #include <linux/atomic.h>
 #include <linux/uidgid.h>
 #include <linux/wait.h>
+#include <linux/kobject_ns.h>
 
 struct file;
 struct dentry;
@@ -137,8 +138,9 @@ struct kernfs_node {
 
 	struct rb_node		rb;
 
-	const void		*ns;	/* namespace tag */
-	unsigned int		hash;	/* ns + name hash */
+	const void		*ns;		/* namespace tag */
+	enum kobj_ns_type	ns_type;	/* type of namespace tag */
+	unsigned int		hash;		/* ns + name hash */
 	union {
 		struct kernfs_elem_dir		dir;
 		struct kernfs_elem_symlink	symlink;
@@ -275,7 +277,7 @@ struct kernfs_ops {
  */
 struct kernfs_fs_context {
 	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
-	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
+	void			*ns_tag[KOBJ_NS_TYPES]; /* Namespace tags of the mount (or empty) */
 	unsigned long		magic;		/* File system specific magic number */
 
 	/* The following are set/used by kernfs_mount() */
@@ -319,17 +321,20 @@ static inline ino_t kernfs_gen(struct kernfs_node *kn)
 
 /**
  * kernfs_enable_ns - enable namespace under a directory
- * @kn: directory of interest, should be empty
+ * @kn:		directory of interest, should be empty
+ * @ns_type:	type of namespace that should be enabled for this directory
  *
  * This is to be called right after @kn is created to enable namespace
  * under it.  All children of @kn must have non-NULL namespace tags and
  * only the ones which match the super_block's tag will be visible.
  */
-static inline void kernfs_enable_ns(struct kernfs_node *kn)
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type)
 {
 	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
 	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
 	kn->flags |= KERNFS_NS;
+	kn->ns_type = ns_type;
 }
 
 /**
@@ -401,7 +406,7 @@ int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 		     const void *value, size_t size, int flags);
 
-const void *kernfs_super_ns(struct super_block *sb);
+const void **kernfs_super_ns(struct super_block *sb);
 int kernfs_get_tree(struct fs_context *fc);
 void kernfs_free_fs_context(struct fs_context *fc);
 void kernfs_kill_sb(struct super_block *sb);
@@ -415,7 +420,8 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
 { return 0; }	/* whatever */
 
-static inline void kernfs_enable_ns(struct kernfs_node *kn) { }
+static inline void kernfs_enable_ns(struct kernfs_node *kn,
+				    enum kobj_ns_type ns_type) { }
 
 static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 { return false; }
@@ -511,7 +517,7 @@ static inline int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 				   const void *value, size_t size, int flags)
 { return -ENOSYS; }
 
-static inline const void *kernfs_super_ns(struct super_block *sb)
+static inline const void **kernfs_super_ns(struct super_block *sb)
 { return NULL; }
 
 static inline int kernfs_get_tree(struct fs_context *fc)
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 80bb865b3a33..d127b3487abc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -306,9 +306,10 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
 
 int __must_check sysfs_init(void);
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
-	return kernfs_enable_ns(kn);
+	return kernfs_enable_ns(kn, ns_type);
 }
 
 int sysfs_file_change_owner(struct kobject *kobj, const char *name, kuid_t kuid,
@@ -531,7 +532,8 @@ static inline int __must_check sysfs_init(void)
 	return 0;
 }
 
-static inline void sysfs_enable_ns(struct kernfs_node *kn)
+static inline void sysfs_enable_ns(struct kernfs_node *kn,
+				   enum kobj_ns_type ns_type)
 {
 }
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 6f07083cc111..c58c62d49a10 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -120,7 +120,7 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(ops->type >= KOBJ_NS_TYPES);
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
-		sysfs_enable_ns(kobj->sd);
+		sysfs_enable_ns(kobj->sd, ops->type);
 	}
 
 	return 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/7] loop: preserve sysfs backwards compatibility
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
                   ` (3 preceding siblings ...)
  2020-04-24 16:20 ` [PATCH v3 4/7] kernfs: handle multiple namespace tags Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 7/7] loopfs: only show devices in their correct instance Christian Brauner
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

For sysfs the initial namespace is special. All devices currently
propagate into all non-initial namespaces. For example, sysfs is usually
mounted in a privileged or unprivileged container and all devices are
visible to the container but are owned by global root. Even though none
of the propagated files can be used there are still a lot of read-only
values that are accessed or read by tools running in non-initial
namespaces. Some devices though, which can be moved or created in
another namespace, will only show up in the corresponding namespace.
This currently includes network and loop devices but no other ones.
Since all current workloads depend on devices from the inital namespace
being visible this behavior cannot be simply changed. This patch just
makes sure to keep propagating devices that share the same device class
with loop devices from the initial namespaces into all non-initial
namespaces as before. In short, nothing changes only loopfs loop devices
will be shown in their correct namespace.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Protect init_net with a CONFIG_NET ifdef in case it is set to "n".
  - As Tejun pointed out there is argument to be made that a new mount
    option for sysfs could be added that would change how devices are
    propagated. This patch does not prevent this but it is an orthogonal
    problem.

/* v3 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Readd missing callback that accidently got removed in the last
    version.
---
 block/genhd.c               | 85 +++++++++++++++++++++++++++++++++++++
 fs/kernfs/dir.c             | 34 ++++++++++++---
 fs/kernfs/kernfs-internal.h | 24 +++++++++++
 fs/sysfs/mount.c            |  4 ++
 include/linux/genhd.h       |  3 ++
 include/linux/kernfs.h      | 22 ++++++++++
 include/linux/kobject_ns.h  |  4 ++
 lib/kobject.c               |  2 +
 8 files changed, 173 insertions(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 06b642b23a07..c79a7c97f458 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1198,11 +1198,87 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
 	return NULL;
 }
 
+#ifdef CONFIG_BLK_DEV_LOOPFS
+static void *user_grab_current_ns(void)
+{
+	struct user_namespace *ns = current_user_ns();
+	return get_user_ns(ns);
+}
+
+static const void *user_initial_ns(void)
+{
+	return &init_user_ns;
+}
+
+static void user_put_ns(void *p)
+{
+	struct user_namespace *ns = p;
+	put_user_ns(ns);
+}
+
+static bool user_current_may_mount(void)
+{
+	return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
+}
+
+static bool user_initial_ns_propagates(void)
+{
+	return true;
+}
+
+const struct kobj_ns_type_operations user_ns_type_operations = {
+	.type			= KOBJ_NS_TYPE_USER,
+	.current_may_mount	= user_current_may_mount,
+	.grab_current_ns	= user_grab_current_ns,
+	.initial_ns		= user_initial_ns,
+	.drop_ns		= user_put_ns,
+	.initial_ns_propagates	= user_initial_ns_propagates,
+};
+
+static const void *block_class_user_namespace(struct device *dev)
+{
+	struct gendisk *disk;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	return disk->user_ns;
+}
+
+static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
+{
+	struct gendisk *disk;
+	struct user_namespace *ns;
+
+	if (dev->type == &part_type)
+		disk = part_to_disk(dev_to_part(dev));
+	else
+		disk = dev_to_disk(dev);
+
+	ns = disk->user_ns;
+	if (ns && ns != &init_user_ns) {
+		kuid_t ns_root_uid = make_kuid(ns, 0);
+		kgid_t ns_root_gid = make_kgid(ns, 0);
+
+		if (uid_valid(ns_root_uid))
+			*uid = ns_root_uid;
+
+		if (gid_valid(ns_root_gid))
+			*gid = ns_root_gid;
+	}
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
 static int __init genhd_device_init(void)
 {
 	int error;
 
 	block_class.dev_kobj = sysfs_dev_block_kobj;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	kobj_ns_type_register(&user_ns_type_operations);
+#endif
 	error = class_register(&block_class);
 	if (unlikely(error))
 		return error;
@@ -1524,8 +1600,14 @@ static void disk_release(struct device *dev)
 		blk_put_queue(disk->queue);
 	kfree(disk);
 }
+
 struct class block_class = {
 	.name		= "block",
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	.ns_type	= &user_ns_type_operations,
+	.namespace	= block_class_user_namespace,
+	.get_ownership	= block_class_get_ownership,
+#endif
 };
 
 static char *block_devnode(struct device *dev, umode_t *mode,
@@ -1715,6 +1797,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 		disk_to_dev(disk)->class = &block_class;
 		disk_to_dev(disk)->type = &disk_type;
 		device_initialize(disk_to_dev(disk));
+#ifdef CONFIG_BLK_DEV_LOOPFS
+		disk->user_ns = &init_user_ns;
+#endif
 	}
 	return disk;
 }
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1f2d894ae454..02796ba6521a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -575,10 +575,15 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved to a different namespace */
-	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
-		goto out_bad;
+	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
+		if (kernfs_init_ns_propagates(kn->parent) &&
+		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
+			goto out_good;
+		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
+			goto out_bad;
+	}
 
+out_good:
 	mutex_unlock(&kernfs_mutex);
 	return 1;
 out_bad:
@@ -1090,6 +1095,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	if (!kn && kernfs_init_ns_propagates(parent)) {
+		ns = kernfs_init_ns(parent->ns_type);
+		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	}
 
 	/* no such entry */
 	if (!kn || !kernfs_active(kn)) {
@@ -1614,6 +1623,8 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+	const void *init_ns;
+
 	if (pos) {
 		int valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
@@ -1621,6 +1632,12 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		if (!valid)
 			pos = NULL;
 	}
+
+	if (kernfs_init_ns_propagates(parent))
+		init_ns = kernfs_init_ns(parent->ns_type);
+	else
+		init_ns = NULL;
+
 	if (!pos && (hash > 1) && (hash < INT_MAX)) {
 		struct rb_node *node = parent->dir.children.rb_node;
 		while (node) {
@@ -1635,7 +1652,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1650,13 +1667,20 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 {
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
 	if (pos) {
+		const void *init_ns;
+		if (kernfs_init_ns_propagates(parent))
+			init_ns = kernfs_init_ns(parent->ns_type);
+		else
+			init_ns = NULL;
+
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
 			if (!node)
 				pos = NULL;
 			else
 				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
+		} while (pos && (!kernfs_active(pos) ||
+				 (pos->ns != ns && pos->ns != init_ns)));
 	}
 	return pos;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7c972c00f84a..74eb6c447361 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -80,6 +80,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
 	return d_inode(dentry)->i_private;
 }
 
+#ifdef CONFIG_NET
+extern struct net init_net;
+#endif
+
+extern struct user_namespace init_user_ns;
+
+static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
+{
+	switch (ns_type) {
+	case KOBJ_NS_TYPE_NET:
+#ifdef CONFIG_NET
+		return &init_net;
+#else
+		break;
+#endif
+	case KOBJ_NS_TYPE_USER:
+		return &init_user_ns;
+	default:
+		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
+	}
+
+	return NULL;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 711844a97b4c..dc02fde0e60c 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 
 	if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
 		kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
+	if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
+		kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
 	kernfs_free_fs_context(fc);
 	kfree(kfc);
 }
@@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+	kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
 	kfc->root = sysfs_root;
 	kfc->magic = SYSFS_MAGIC;
 	fc->fs_private = kfc;
@@ -84,6 +87,7 @@ static void sysfs_kill_sb(struct super_block *sb)
 	void **ns = (void **)kernfs_super_ns(sb);
 
 	kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
+	kobj_ns_drop(KOBJ_NS_TYPE_USER, ns[KOBJ_NS_TYPE_USER]);
 	kernfs_kill_sb(sb);
 }
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 9b3fffdf4011..4edf8a82e815 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -220,6 +220,9 @@ struct gendisk {
 	int node_id;
 	struct badblocks *bb;
 	struct lockdep_map lockdep_map;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	struct user_namespace *user_ns;
+#endif
 };
 
 static inline struct gendisk *part_to_disk(struct hd_struct *part)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d0544f2e0c99..7809584a7c8e 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -53,6 +53,7 @@ enum kernfs_node_flag {
 	KERNFS_SUICIDED		= 0x0800,
 	KERNFS_EMPTY_DIR	= 0x1000,
 	KERNFS_HAS_RELEASE	= 0x2000,
+	KERNFS_NS_PROPAGATE	= 0x4000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -337,6 +338,27 @@ static inline void kernfs_enable_ns(struct kernfs_node *kn,
 	kn->ns_type = ns_type;
 }
 
+static inline void kernfs_enable_init_ns_propagates(struct kernfs_node *kn)
+{
+	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
+	WARN_ON_ONCE(!(kn->flags & KERNFS_NS));
+	kn->flags |= KERNFS_NS_PROPAGATE;
+}
+
+/**
+ * kernfs_init_ns_propagates - test whether init ns propagates
+ * @kn: the node to test
+ *
+ * Test whether kernfs entries created in the init namespace propagate into
+ * other namespaces.
+ */
+static inline bool kernfs_init_ns_propagates(const struct kernfs_node *kn)
+{
+	return ((kn->flags & (KERNFS_NS | KERNFS_NS_PROPAGATE)) ==
+		(KERNFS_NS | KERNFS_NS_PROPAGATE));
+}
+
 /**
  * kernfs_ns_enabled - test whether namespace is enabled
  * @kn: the node to test
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 991a9286bcea..a9c45bcce235 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -26,6 +26,7 @@ struct kobject;
 enum kobj_ns_type {
 	KOBJ_NS_TYPE_NONE = 0,
 	KOBJ_NS_TYPE_NET,
+	KOBJ_NS_TYPE_USER,
 	KOBJ_NS_TYPES
 };
 
@@ -34,6 +35,8 @@ enum kobj_ns_type {
  *   @grab_current_ns: return a new reference to calling task's namespace
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
+ *   @initial_ns_propagates: whether devices in the initial namespace propagate
+ *			to all other namespaces
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
@@ -41,6 +44,7 @@ struct kobj_ns_type_operations {
 	void *(*grab_current_ns)(void);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
+	bool (*initial_ns_propagates)(void);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
diff --git a/lib/kobject.c b/lib/kobject.c
index c58c62d49a10..96bb8c732d1c 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -121,6 +121,8 @@ static int create_dir(struct kobject *kobj)
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
 		sysfs_enable_ns(kobj->sd, ops->type);
+		if (ops->initial_ns_propagates && ops->initial_ns_propagates())
+			kernfs_enable_init_ns_propagates(kobj->sd);
 	}
 
 	return 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 6/7] loopfs: start attaching correct namespace during loop_add()
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
                   ` (4 preceding siblings ...)
  2020-04-24 16:20 ` [PATCH v3 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  2020-04-24 16:20 ` [PATCH v3 7/7] loopfs: only show devices in their correct instance Christian Brauner
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Tag loop devices with the namespace the loopfs instance was mounted in.
This has the consequence that loopfs devices carry the correct sysfs
permissions for all their core files. All other devices files will
continue to be correctly owned by the initial namespaces. Here is sample
output:

root@b1:~# mount -t loop loop /mnt
root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
root@b1:~# losetup -f
/dev/loop8
root@b1:~# ln -sf /mnt/loop8 /dev/loop8
root@b1:~# ls -al /sys/class/block/loop8
lrwxrwxrwx 1 root root 0 Apr  7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
root@b1:~# ls -al /sys/class/block/loop8/
total 0
drwxr-xr-x  9 root   root       0 Apr  7 13:06 .
drwxr-xr-x 18 nobody nogroup    0 Apr  7 13:07 ..
-r--r--r--  1 root   root    4096 Apr  7 13:06 alignment_offset
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 bdi -> ../../bdi/7:8
-r--r--r--  1 root   root    4096 Apr  7 13:06 capability
-r--r--r--  1 root   root    4096 Apr  7 13:06 dev
-r--r--r--  1 root   root    4096 Apr  7 13:06 discard_alignment
-r--r--r--  1 root   root    4096 Apr  7 13:06 events
-r--r--r--  1 root   root    4096 Apr  7 13:06 events_async
-rw-r--r--  1 root   root    4096 Apr  7 13:06 events_poll_msecs
-r--r--r--  1 root   root    4096 Apr  7 13:06 ext_range
-r--r--r--  1 root   root    4096 Apr  7 13:06 hidden
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 holders
-r--r--r--  1 root   root    4096 Apr  7 13:06 inflight
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 integrity
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 mq
drwxr-xr-x  2 root   root       0 Apr  7 13:06 power
drwxr-xr-x  3 nobody nogroup    0 Apr  7 13:07 queue
-r--r--r--  1 root   root    4096 Apr  7 13:06 range
-r--r--r--  1 root   root    4096 Apr  7 13:06 removable
-r--r--r--  1 root   root    4096 Apr  7 13:06 ro
-r--r--r--  1 root   root    4096 Apr  7 13:06 size
drwxr-xr-x  2 nobody nogroup    0 Apr  7 13:07 slaves
-r--r--r--  1 root   root    4096 Apr  7 13:06 stat
lrwxrwxrwx  1 nobody nogroup    0 Apr  7 13:07 subsystem -> ../../../../class/block
drwxr-xr-x  2 root   root       0 Apr  7 13:06 trace
-rw-r--r--  1 root   root    4096 Apr  7 13:06 uevent
root@b1:~#

Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Adapted commit message otherwise unchanged.

/* v3 */
unchanged
---
 drivers/block/loop.c          | 3 +++
 drivers/block/loopfs/loopfs.c | 6 ++++++
 drivers/block/loopfs/loopfs.h | 1 +
 3 files changed, 10 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 40705f5aeabd..a5fe05cba896 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2207,6 +2207,9 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
 	disk->private_data	= lo;
 	disk->queue		= lo->lo_queue;
 	sprintf(disk->disk_name, "loop%d", i);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+	loopfs_init(disk, inode);
+#endif
 
 	add_disk(disk);
 
diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
index 09cd5a919ea2..9fa60c1bcc05 100644
--- a/drivers/block/loopfs/loopfs.c
+++ b/drivers/block/loopfs/loopfs.c
@@ -74,6 +74,12 @@ bool loopfs_wants_remove(const struct loop_device *lo)
 	       (lo->lo_info->lo_flags & LOOPFS_FLAGS_INACTIVE);
 }
 
+void loopfs_init(struct gendisk *disk, struct inode *inode)
+{
+	if (loopfs_i_sb(inode))
+		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+}
+
 /**
  * loopfs_add - allocate inode from super block of a loopfs mount
  * @lo:		loop device for which we are creating a new device entry
diff --git a/drivers/block/loopfs/loopfs.h b/drivers/block/loopfs/loopfs.h
index 225d844f5a01..7ca1b872b36e 100644
--- a/drivers/block/loopfs/loopfs.h
+++ b/drivers/block/loopfs/loopfs.h
@@ -31,6 +31,7 @@ extern void loopfs_remove(struct loop_device *lo);
 extern bool loopfs_wants_remove(const struct loop_device *lo);
 extern void loopfs_evict_locked(struct loop_device *lo);
 extern int loopfs_rundown_locked(struct loop_device *lo);
+extern void loopfs_init(struct gendisk *disk, struct inode *inode);
 
 #endif
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 7/7] loopfs: only show devices in their correct instance
  2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
                   ` (5 preceding siblings ...)
  2020-04-24 16:20 ` [PATCH v3 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
@ 2020-04-24 16:20 ` Christian Brauner
  6 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-04-24 16:20 UTC (permalink / raw)
  To: Jens Axboe, Greg Kroah-Hartman, linux-kernel, linux-block, linux-api
  Cc: Jonathan Corbet, Serge Hallyn, Rafael J. Wysocki, Tejun Heo,
	David S. Miller, Christian Brauner, Saravana Kannan, Jan Kara,
	David Howells, Seth Forshee, David Rheinsberg, Tom Gundersen,
	Christian Kellner, Dmitry Vyukov, Stéphane Graber,
	linux-doc, netdev, Steve Barber, Dylan Reid,
	Filipe Brandenburger, Kees Cook, Benjamin Elder, Akihiro Suda

Since loopfs devices belong to a loopfs instance they have no business
polluting the host's devtmpfs mount and should not propagate out of the
namespace they belong to.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 block/partitions/core.c       | 1 +
 drivers/base/devtmpfs.c       | 4 ++--
 drivers/block/loopfs/loopfs.c | 4 +++-
 include/linux/device.h        | 3 +++
 4 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/partitions/core.c b/block/partitions/core.c
index bc1ded1331b1..5761f5c38588 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -416,6 +416,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	pdev->class = &block_class;
 	pdev->type = &part_type;
 	pdev->parent = ddev;
+	pdev->no_devnode = ddev->no_devnode;
 
 	err = blk_alloc_devt(p, &devt);
 	if (err)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..77371ceb88fa 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -111,7 +111,7 @@ int devtmpfs_create_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.mode = 0;
@@ -138,7 +138,7 @@ int devtmpfs_delete_node(struct device *dev)
 	const char *tmp = NULL;
 	struct req req;
 
-	if (!thread)
+	if (!thread || dev->no_devnode)
 		return 0;
 
 	req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
diff --git a/drivers/block/loopfs/loopfs.c b/drivers/block/loopfs/loopfs.c
index 9fa60c1bcc05..1bcb0b44c910 100644
--- a/drivers/block/loopfs/loopfs.c
+++ b/drivers/block/loopfs/loopfs.c
@@ -76,8 +76,10 @@ bool loopfs_wants_remove(const struct loop_device *lo)
 
 void loopfs_init(struct gendisk *disk, struct inode *inode)
 {
-	if (loopfs_i_sb(inode))
+	if (loopfs_i_sb(inode)) {
 		disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+		disk_to_dev(disk)->no_devnode = true;
+	}
 }
 
 /**
diff --git a/include/linux/device.h b/include/linux/device.h
index ac8e37cd716a..c69ef1c5a0ef 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -523,6 +523,8 @@ struct dev_links_info {
  *		  sync_state() callback.
  * @dma_coherent: this particular device is dma coherent, even if the
  *		architecture supports non-coherent devices.
+ * @no_devnode: whether device nodes associated with this device are kept out
+ *		of devtmpfs (e.g. due to separate filesystem)
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -622,6 +624,7 @@ struct device {
     defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
 	bool			dma_coherent:1;
 #endif
+	bool			no_devnode:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-04-24 16:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-24 16:20 [PATCH v3 0/7] loopfs Christian Brauner
2020-04-24 16:20 ` [PATCH v3 1/7] kobject_uevent: remove unneeded netlink_ns check Christian Brauner
2020-04-24 16:20 ` [PATCH v3 2/7] loopfs: implement loopfs Christian Brauner
2020-04-24 16:20 ` [PATCH v3 3/7] loop: use ns_capable for some loop operations Christian Brauner
2020-04-24 16:20 ` [PATCH v3 4/7] kernfs: handle multiple namespace tags Christian Brauner
2020-04-24 16:20 ` [PATCH v3 5/7] loop: preserve sysfs backwards compatibility Christian Brauner
2020-04-24 16:20 ` [PATCH v3 6/7] loopfs: start attaching correct namespace during loop_add() Christian Brauner
2020-04-24 16:20 ` [PATCH v3 7/7] loopfs: only show devices in their correct instance Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).