linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] proc: "mount -o lookup=" support
@ 2022-01-19 15:48 Alexey Dobriyan
  2022-01-19 15:57 ` Alexey Dobriyan
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-01-19 15:48 UTC (permalink / raw)
  To: viro, ebiederm, akpm
  Cc: linux-kernel, linux-fsdevel, stephen.s.brennan, legion

From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Mon, 22 Nov 2021 20:41:06 +0300
Subject: [PATCH 1/1] proc: "mount -o lookup=..." support

Docker implements MaskedPaths configuration option

	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97

to disable certain /proc files. It overmounts them with /dev/null.

Implement proper mount option which selectively disables lookup/readdir
in the top level /proc directory so that MaskedPaths doesn't need
to be updated as time goes on.

Syntax is

			Filter everything
	# mount -t proc -o lookup=/ proc /proc
	# ls /proc
	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163

			Allow /proc/cpuinfo and /proc/uptime
	# mount -t proc proc -o lookup=cpuinfo/uptime /proc

	# ls /proc
				...
	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
	-r--r--r--   1 root       root          0 Nov 22 21:12 cpuinfo
	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163
	-r--r--r--   1 root       root          0 Nov 22 21:12 uptime

Trailing slash is optional but saves 1 allocation.
Trailing slash is mandatory for "filter everything".

Remounting with lookup= is disabled so that files and dcache entries
don't stay active while filter list is changed. Users are supposed
to unmount and mount again with different lookup= set.
Remount rules may change in the future. (Eric W. Biederman)

Re: speed
This is the price for filtering, given that lookup= is whitelist it is
not supposed to be very long. Second, it is one linear memory scan per
lookup, there are no linked lists. It may be faster than rbtree in fact.
It consumes 1 allocation per superblock which is list of names itself.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

	v2
	documentation!
	descriptive comments!
	disable remount

 Documentation/filesystems/proc.rst |   8 ++
 fs/proc/generic.c                  |  18 ++--
 fs/proc/internal.h                 |  31 ++++++-
 fs/proc/proc_net.c                 |   2 +-
 fs/proc/root.c                     | 127 ++++++++++++++++++++++++++++-
 include/linux/proc_fs.h            |   2 +
 6 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8d7f141c6fc7..9a328f0b4346 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2186,6 +2186,7 @@ The following mount options are supported:
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
 	subset=		Show only the specified subset of procfs.
+        lookup=         Top-level /proc filter, independent of subset=
 	=========	========================================================
 
 hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2218,6 +2219,13 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+lookup= mount option makes available only listed files/directories in
+the top-level /proc directory. Individual names are separated
+by slash. Empty list is equivalent to subset=pid. lookup= filters before
+subset= if both options are supplied. lookup= doesn't affect /proc/${pid}
+directories availability as well as /proc/self and /proc/thread-self
+symlinks. More fine-grained filtering is not supported at the moment.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 5b78739e60e4..4d04f8d89cdc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -282,7 +282,7 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
  * for success..
  */
 int proc_readdir_de(struct file *file, struct dir_context *ctx,
-		    struct proc_dir_entry *de)
+		    struct proc_dir_entry *de, const struct proc_lookup_list *ll)
 {
 	int i;
 
@@ -307,12 +307,15 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
 		struct proc_dir_entry *next;
 		pde_get(de);
 		read_unlock(&proc_subdir_lock);
-		if (!dir_emit(ctx, de->name, de->namelen,
-			    de->low_ino, de->mode >> 12)) {
-			pde_put(de);
-			return 0;
+
+		if (in_lookup_list(ll, de->name, de->namelen)) {
+			if (!dir_emit(ctx, de->name, de->namelen, de->low_ino, de->mode >> 12)) {
+				pde_put(de);
+				return 0;
+			}
+			ctx->pos++;
 		}
-		ctx->pos++;
+
 		read_lock(&proc_subdir_lock);
 		next = pde_subdir_next(de);
 		pde_put(de);
@@ -330,7 +333,8 @@ int proc_readdir(struct file *file, struct dir_context *ctx)
 	if (fs_info->pidonly == PROC_PIDONLY_ON)
 		return 1;
 
-	return proc_readdir_de(file, ctx, PDE(inode));
+	return proc_readdir_de(file, ctx, PDE(inode),
+				PDE(inode) == &proc_root ? fs_info->lookup_list : NULL);
 }
 
 /*
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 03415f3fb3a8..e74acb437c56 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -190,7 +190,7 @@ struct proc_dir_entry *proc_register(struct proc_dir_entry *dir,
 extern struct dentry *proc_lookup(struct inode *, struct dentry *, unsigned int);
 struct dentry *proc_lookup_de(struct inode *, struct dentry *, struct proc_dir_entry *);
 extern int proc_readdir(struct file *, struct dir_context *);
-int proc_readdir_de(struct file *, struct dir_context *, struct proc_dir_entry *);
+int proc_readdir_de(struct file *, struct dir_context *, struct proc_dir_entry *, const struct proc_lookup_list *);
 
 static inline void pde_get(struct proc_dir_entry *pde)
 {
@@ -318,3 +318,32 @@ static inline void pde_force_lookup(struct proc_dir_entry *pde)
 	/* /proc/net/ entries can be changed under us by setns(CLONE_NEWNET) */
 	pde->proc_dops = &proc_net_dentry_ops;
 }
+
+/*
+ * Pascal strings stiched together making filtering memory access pattern linear.
+ *
+ * "mount -t proc -o lookup=/" results in
+ *
+ *	(u8[]){
+ *		0
+ *	}
+ *
+ * "mount -t proc -o lookup=cpuinfo/uptime/" results in
+ *
+ *	(u8[]){
+ *		7, 'c', 'p', 'u', 'i', 'n', 'f', 'o',
+ *		6, 'u', 'p', 't', 'i', 'm', 'e',
+ *		0
+ *	}
+ */
+struct proc_lookup_list {
+	u8 len;
+	char str[];
+};
+
+static inline struct proc_lookup_list *lookup_list_next(const struct proc_lookup_list *ll)
+{
+	return (struct proc_lookup_list *)((void *)ll + 1 + ll->len);
+}
+
+bool in_lookup_list(const struct proc_lookup_list *ll, const char *str, unsigned int len);
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 15c2e55d2ed2..7941df2d3d74 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -321,7 +321,7 @@ static int proc_tgid_net_readdir(struct file *file, struct dir_context *ctx)
 	ret = -EINVAL;
 	net = get_proc_task_net(file_inode(file));
 	if (net != NULL) {
-		ret = proc_readdir_de(file, ctx, net->proc_net);
+		ret = proc_readdir_de(file, ctx, net->proc_net, NULL);
 		put_net(net);
 	}
 	return ret;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index c7e3b1350ef8..8000558d7d2c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -35,18 +35,22 @@ struct proc_fs_context {
 	enum proc_hidepid	hidepid;
 	int			gid;
 	enum proc_pidonly	pidonly;
+	struct proc_lookup_list	*lookup_list;
+	unsigned int		lookup_list_len;
 };
 
 enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
 	Opt_subset,
+	Opt_lookup,
 };
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
 	fsparam_u32("gid",	Opt_gid),
 	fsparam_string("hidepid",	Opt_hidepid),
 	fsparam_string("subset",	Opt_subset),
+	fsparam_string("lookup",	Opt_lookup),
 	{}
 };
 
@@ -112,6 +116,65 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
 	return 0;
 }
 
+static int proc_parse_lookup_param(struct fs_context *fc, char *str0)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct proc_lookup_list *ll;
+	char *str;
+	const char *slash;
+	const char *src;
+	unsigned int len;
+	int rv;
+
+	/* Force trailing slash, simplify loops below. */
+	len = strlen(str0);
+	if (len > 0 && str0[len - 1] == '/') {
+		str = str0;
+	} else {
+		str = kmalloc(len + 2, GFP_KERNEL);
+		if (!str) {
+			rv = -ENOMEM;
+			goto out;
+		}
+		memcpy(str, str0, len);
+		str[len] = '/';
+		str[len + 1] = '\0';
+	}
+
+	len = 0;
+	for (src = str; (slash = strchr(src, '/')); src = slash + 1) {
+		if (slash - src >= 256) {
+			rv = -EINVAL;
+			goto out_free_str;
+		}
+		len += 1 + (slash - src);
+	}
+	len += 1;
+
+	ctx->lookup_list = ll = kmalloc(len, GFP_KERNEL);
+	ctx->lookup_list_len = len;
+	if (!ll) {
+		rv = -ENOMEM;
+		goto out_free_str;
+	}
+
+	for (src = str; (slash = strchr(src, '/')); src = slash + 1) {
+		ll->len = slash - src;
+		memcpy(ll->str, src, ll->len);
+		ll = lookup_list_next(ll);
+	}
+	ll->len = 0;
+
+	rv = 0;
+
+out_free_str:
+	if (str != str0) {
+		kfree(str);
+	}
+out:
+	return rv;
+}
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -137,6 +200,11 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 			return -EINVAL;
 		break;
 
+	case Opt_lookup:
+		if (proc_parse_lookup_param(fc, param->string) < 0)
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -157,6 +225,10 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset))
 		fs_info->pidonly = ctx->pidonly;
+	if (ctx->mask & (1 << Opt_lookup)) {
+		fs_info->lookup_list = ctx->lookup_list;
+		ctx->lookup_list = NULL;
+	}
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -218,6 +290,14 @@ static int proc_reconfigure(struct fs_context *fc)
 	struct super_block *sb = fc->root->d_sb;
 	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
+	/*
+	 * "Hide everything" lookup filter is not a problem as only
+	 * /proc/${pid}, /proc/self and /proc/thread-self are accessible.
+	 */
+	if (fs_info->lookup_list && fs_info->lookup_list->len > 0) {
+		return invalfc(fc, "'-o remount,lookup=' is unsupported, unmount and mount instead");
+	}
+
 	sync_filesystem(sb);
 
 	proc_apply_options(fs_info, fc, current_user_ns());
@@ -234,11 +314,34 @@ static void proc_fs_context_free(struct fs_context *fc)
 	struct proc_fs_context *ctx = fc->fs_private;
 
 	put_pid_ns(ctx->pid_ns);
+	kfree(ctx->lookup_list);
 	kfree(ctx);
 }
 
+static int proc_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	struct proc_fs_context *src = fc->fs_private;
+	struct proc_fs_context *dst;
+
+	dst = kmemdup(src, sizeof(struct proc_fs_context), GFP_KERNEL);
+	if (!dst) {
+		return -ENOMEM;
+	}
+
+	dst->lookup_list = kmemdup(dst->lookup_list, dst->lookup_list_len, GFP_KERNEL);
+	if (!dst->lookup_list) {
+		kfree(dst);
+		return -ENOMEM;
+	}
+	get_pid_ns(dst->pid_ns);
+
+	fc->fs_private = dst;
+	return 0;
+}
+
 static const struct fs_context_operations proc_fs_context_ops = {
 	.free		= proc_fs_context_free,
+	.dup		= proc_fs_context_dup,
 	.parse_param	= proc_parse_param,
 	.get_tree	= proc_get_tree,
 	.reconfigure	= proc_reconfigure,
@@ -274,6 +377,7 @@ static void proc_kill_sb(struct super_block *sb)
 
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
+	kfree(fs_info->lookup_list);
 	kfree(fs_info);
 }
 
@@ -317,12 +421,33 @@ static int proc_root_getattr(struct user_namespace *mnt_userns,
 	return 0;
 }
 
+bool in_lookup_list(const struct proc_lookup_list *ll, const char *str, unsigned int len)
+{
+	if (ll) {
+		for (; ll->len > 0; ll = lookup_list_next(ll)) {
+			if (ll->len == len && strncmp(ll->str, str, len) == 0) {
+				return true;
+			}
+		}
+		return false;
+	} else {
+		return true;
+	}
+}
+
 static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
 {
+	struct proc_fs_info *proc_sb = proc_sb_info(dir->i_sb);
+
 	if (!proc_pid_lookup(dentry, flags))
 		return NULL;
 
-	return proc_lookup(dir, dentry, flags);
+	if (in_lookup_list(proc_sb->lookup_list, dentry->d_name.name, dentry->d_name.len)) {
+		return proc_lookup(dir, dentry, flags);
+	} else {
+		return NULL;
+	}
+
 }
 
 static int proc_root_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 069c7fd95396..d2c067560bf9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -10,6 +10,7 @@
 #include <linux/fs.h>
 
 struct proc_dir_entry;
+struct proc_lookup_list;
 struct seq_file;
 struct seq_operations;
 
@@ -65,6 +66,7 @@ struct proc_fs_info {
 	kgid_t pid_gid;
 	enum proc_hidepid hide_pid;
 	enum proc_pidonly pidonly;
+	const struct proc_lookup_list *lookup_list;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 15:48 [PATCH v2] proc: "mount -o lookup=" support Alexey Dobriyan
@ 2022-01-19 15:57 ` Alexey Dobriyan
  2022-01-19 16:24 ` Christian Brauner
  2022-01-19 17:04 ` Alexey Gladkov
  2 siblings, 0 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-01-19 15:57 UTC (permalink / raw)
  To: stephen.s.brennan
  Cc: linux-kernel, linux-fsdevel, stephen.s.brennan, legion, viro,
	ebiederm, akpm

On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> 	# mount -t proc -o lookup=/ proc /proc

> +static int proc_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> +	struct proc_fs_context *src = fc->fs_private;
> +	struct proc_fs_context *dst;
> +
> +	dst = kmemdup(src, sizeof(struct proc_fs_context), GFP_KERNEL);
> +	if (!dst) {
> +		return -ENOMEM;
> +	}
> +
> +	dst->lookup_list = kmemdup(dst->lookup_list, dst->lookup_list_len, GFP_KERNEL);
> +	if (!dst->lookup_list) {
> +		kfree(dst);
> +		return -ENOMEM;
> +	}
> +	get_pid_ns(dst->pid_ns);
> +
> +	fc->fs_private = dst;
> +	return 0;
> +}

Stephen, sorry for not replying earlier.

I don't pretend to understand fully what ->dup() is supposed to do.
And the above code was not tested.

In particular

	p->a = kmemdup(p->a, ...)

reads like "MEMORY LEAK" on the first glance but it is not.

Understanding ->dup is the next thing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 15:48 [PATCH v2] proc: "mount -o lookup=" support Alexey Dobriyan
  2022-01-19 15:57 ` Alexey Dobriyan
@ 2022-01-19 16:24 ` Christian Brauner
  2022-01-19 17:15   ` Alexey Gladkov
                     ` (2 more replies)
  2022-01-19 17:04 ` Alexey Gladkov
  2 siblings, 3 replies; 11+ messages in thread
From: Christian Brauner @ 2022-01-19 16:24 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel,
	stephen.s.brennan, legion, cyphar

On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> From: Alexey Dobriyan <adobriyan@gmail.com>
> Date: Mon, 22 Nov 2021 20:41:06 +0300
> Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> 
> Docker implements MaskedPaths configuration option
> 
> 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> 
> to disable certain /proc files. It overmounts them with /dev/null.
> 
> Implement proper mount option which selectively disables lookup/readdir
> in the top level /proc directory so that MaskedPaths doesn't need
> to be updated as time goes on.

I might've missed this when this was sent the last time so maybe it was
clearly explained in an earlier thread: What's the reason this needs to
live in the kernel?

The MaskedPaths entry is optional so runtimes aren't required to block
anything by default and this mostly makes sense for workloads that run
privileged.

In addition MaskedPaths is a generic option which allows to hide any
existing path, not just proc. Even in the very docker-specific defaults
/sys/firmware is covered.

I do see clear value in the subset= and hidepid= options. They are
generally useful independent of opinionated container workloads. I don't
see the same for lookup=.

An alternative I find more sensible is to add a new value for subset=
that hides anything(?) that only global root should have read/write
access too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 15:48 [PATCH v2] proc: "mount -o lookup=" support Alexey Dobriyan
  2022-01-19 15:57 ` Alexey Dobriyan
  2022-01-19 16:24 ` Christian Brauner
@ 2022-01-19 17:04 ` Alexey Gladkov
  2022-01-20 12:26   ` Alexey Dobriyan
  2 siblings, 1 reply; 11+ messages in thread
From: Alexey Gladkov @ 2022-01-19 17:04 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel, stephen.s.brennan

On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> >From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> From: Alexey Dobriyan <adobriyan@gmail.com>
> Date: Mon, 22 Nov 2021 20:41:06 +0300
> Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> 
> Docker implements MaskedPaths configuration option
> 
> 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> 
> to disable certain /proc files. It overmounts them with /dev/null.
> 
> Implement proper mount option which selectively disables lookup/readdir
> in the top level /proc directory so that MaskedPaths doesn't need
> to be updated as time goes on.
> 
> Syntax is
> 
> 			Filter everything
> 	# mount -t proc -o lookup=/ proc /proc
> 	# ls /proc
> 	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
> 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
> 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163
> 
> 			Allow /proc/cpuinfo and /proc/uptime
> 	# mount -t proc proc -o lookup=cpuinfo/uptime /proc
> 
> 	# ls /proc
> 				...
> 	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
> 	-r--r--r--   1 root       root          0 Nov 22 21:12 cpuinfo
> 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
> 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163
> 	-r--r--r--   1 root       root          0 Nov 22 21:12 uptime
> 
> Trailing slash is optional but saves 1 allocation.
> Trailing slash is mandatory for "filter everything".
> 
> Remounting with lookup= is disabled so that files and dcache entries
> don't stay active while filter list is changed. Users are supposed
> to unmount and mount again with different lookup= set.
> Remount rules may change in the future. (Eric W. Biederman)
> 
> Re: speed
> This is the price for filtering, given that lookup= is whitelist it is
> not supposed to be very long. Second, it is one linear memory scan per
> lookup, there are no linked lists. It may be faster than rbtree in fact.
> It consumes 1 allocation per superblock which is list of names itself.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> ---
> 
> 	v2
> 	documentation!
> 	descriptive comments!
> 	disable remount
> 
>  Documentation/filesystems/proc.rst |   8 ++
>  fs/proc/generic.c                  |  18 ++--
>  fs/proc/internal.h                 |  31 ++++++-
>  fs/proc/proc_net.c                 |   2 +-
>  fs/proc/root.c                     | 127 ++++++++++++++++++++++++++++-
>  include/linux/proc_fs.h            |   2 +
>  6 files changed, 178 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8d7f141c6fc7..9a328f0b4346 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2186,6 +2186,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +        lookup=         Top-level /proc filter, independent of subset=

Will it be possible to combine lookup= and subset= options when mounting?

>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2218,6 +2219,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +lookup= mount option makes available only listed files/directories in
> +the top-level /proc directory. Individual names are separated
> +by slash. Empty list is equivalent to subset=pid. lookup= filters before
> +subset= if both options are supplied. lookup= doesn't affect /proc/${pid}
> +directories availability as well as /proc/self and /proc/thread-self
> +symlinks. More fine-grained filtering is not supported at the moment.
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 5b78739e60e4..4d04f8d89cdc 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -282,7 +282,7 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
>   * for success..
>   */
>  int proc_readdir_de(struct file *file, struct dir_context *ctx,
> -		    struct proc_dir_entry *de)
> +		    struct proc_dir_entry *de, const struct proc_lookup_list *ll)
>  {
>  	int i;
>  
> @@ -307,12 +307,15 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
>  		struct proc_dir_entry *next;
>  		pde_get(de);
>  		read_unlock(&proc_subdir_lock);
> -		if (!dir_emit(ctx, de->name, de->namelen,
> -			    de->low_ino, de->mode >> 12)) {
> -			pde_put(de);
> -			return 0;
> +
> +		if (in_lookup_list(ll, de->name, de->namelen)) {
> +			if (!dir_emit(ctx, de->name, de->namelen, de->low_ino, de->mode >> 12)) {
> +				pde_put(de);
> +				return 0;
> +			}
> +			ctx->pos++;
>  		}
> -		ctx->pos++;
> +
>  		read_lock(&proc_subdir_lock);
>  		next = pde_subdir_next(de);
>  		pde_put(de);
> @@ -330,7 +333,8 @@ int proc_readdir(struct file *file, struct dir_context *ctx)
>  	if (fs_info->pidonly == PROC_PIDONLY_ON)
>  		return 1;
>  
> -	return proc_readdir_de(file, ctx, PDE(inode));
> +	return proc_readdir_de(file, ctx, PDE(inode),
> +				PDE(inode) == &proc_root ? fs_info->lookup_list : NULL);
>  }
>  
>  /*
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 03415f3fb3a8..e74acb437c56 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -190,7 +190,7 @@ struct proc_dir_entry *proc_register(struct proc_dir_entry *dir,
>  extern struct dentry *proc_lookup(struct inode *, struct dentry *, unsigned int);
>  struct dentry *proc_lookup_de(struct inode *, struct dentry *, struct proc_dir_entry *);
>  extern int proc_readdir(struct file *, struct dir_context *);
> -int proc_readdir_de(struct file *, struct dir_context *, struct proc_dir_entry *);
> +int proc_readdir_de(struct file *, struct dir_context *, struct proc_dir_entry *, const struct proc_lookup_list *);
>  
>  static inline void pde_get(struct proc_dir_entry *pde)
>  {
> @@ -318,3 +318,32 @@ static inline void pde_force_lookup(struct proc_dir_entry *pde)
>  	/* /proc/net/ entries can be changed under us by setns(CLONE_NEWNET) */
>  	pde->proc_dops = &proc_net_dentry_ops;
>  }
> +
> +/*
> + * Pascal strings stiched together making filtering memory access pattern linear.
> + *
> + * "mount -t proc -o lookup=/" results in
> + *
> + *	(u8[]){
> + *		0
> + *	}
> + *
> + * "mount -t proc -o lookup=cpuinfo/uptime/" results in
> + *
> + *	(u8[]){
> + *		7, 'c', 'p', 'u', 'i', 'n', 'f', 'o',
> + *		6, 'u', 'p', 't', 'i', 'm', 'e',
> + *		0
> + *	}
> + */
> +struct proc_lookup_list {
> +	u8 len;
> +	char str[];
> +};
> +
> +static inline struct proc_lookup_list *lookup_list_next(const struct proc_lookup_list *ll)
> +{
> +	return (struct proc_lookup_list *)((void *)ll + 1 + ll->len);
> +}
> +
> +bool in_lookup_list(const struct proc_lookup_list *ll, const char *str, unsigned int len);
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 15c2e55d2ed2..7941df2d3d74 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -321,7 +321,7 @@ static int proc_tgid_net_readdir(struct file *file, struct dir_context *ctx)
>  	ret = -EINVAL;
>  	net = get_proc_task_net(file_inode(file));
>  	if (net != NULL) {
> -		ret = proc_readdir_de(file, ctx, net->proc_net);
> +		ret = proc_readdir_de(file, ctx, net->proc_net, NULL);
>  		put_net(net);
>  	}
>  	return ret;
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index c7e3b1350ef8..8000558d7d2c 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -35,18 +35,22 @@ struct proc_fs_context {
>  	enum proc_hidepid	hidepid;
>  	int			gid;
>  	enum proc_pidonly	pidonly;
> +	struct proc_lookup_list	*lookup_list;
> +	unsigned int		lookup_list_len;
>  };
>  
>  enum proc_param {
>  	Opt_gid,
>  	Opt_hidepid,
>  	Opt_subset,
> +	Opt_lookup,
>  };
>  
>  static const struct fs_parameter_spec proc_fs_parameters[] = {
>  	fsparam_u32("gid",	Opt_gid),
>  	fsparam_string("hidepid",	Opt_hidepid),
>  	fsparam_string("subset",	Opt_subset),
> +	fsparam_string("lookup",	Opt_lookup),
>  	{}
>  };
>  
> @@ -112,6 +116,65 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
>  	return 0;
>  }
>  
> +static int proc_parse_lookup_param(struct fs_context *fc, char *str0)
> +{
> +	struct proc_fs_context *ctx = fc->fs_private;
> +	struct proc_lookup_list *ll;
> +	char *str;
> +	const char *slash;
> +	const char *src;
> +	unsigned int len;
> +	int rv;
> +
> +	/* Force trailing slash, simplify loops below. */
> +	len = strlen(str0);
> +	if (len > 0 && str0[len - 1] == '/') {
> +		str = str0;
> +	} else {
> +		str = kmalloc(len + 2, GFP_KERNEL);
> +		if (!str) {
> +			rv = -ENOMEM;
> +			goto out;
> +		}
> +		memcpy(str, str0, len);
> +		str[len] = '/';
> +		str[len + 1] = '\0';
> +	}
> +
> +	len = 0;
> +	for (src = str; (slash = strchr(src, '/')); src = slash + 1) {
> +		if (slash - src >= 256) {
> +			rv = -EINVAL;
> +			goto out_free_str;
> +		}
> +		len += 1 + (slash - src);
> +	}
> +	len += 1;
> +
> +	ctx->lookup_list = ll = kmalloc(len, GFP_KERNEL);
> +	ctx->lookup_list_len = len;
> +	if (!ll) {
> +		rv = -ENOMEM;
> +		goto out_free_str;
> +	}
> +
> +	for (src = str; (slash = strchr(src, '/')); src = slash + 1) {
> +		ll->len = slash - src;
> +		memcpy(ll->str, src, ll->len);
> +		ll = lookup_list_next(ll);
> +	}
> +	ll->len = 0;
> +
> +	rv = 0;
> +
> +out_free_str:
> +	if (str != str0) {
> +		kfree(str);
> +	}
> +out:
> +	return rv;
> +}
> +
>  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
>  	struct proc_fs_context *ctx = fc->fs_private;
> @@ -137,6 +200,11 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  			return -EINVAL;
>  		break;
>  
> +	case Opt_lookup:
> +		if (proc_parse_lookup_param(fc, param->string) < 0)
> +			return -EINVAL;
> +		break;
> +
>  	default:
>  		return -EINVAL;
>  	}
> @@ -157,6 +225,10 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
>  		fs_info->hide_pid = ctx->hidepid;
>  	if (ctx->mask & (1 << Opt_subset))
>  		fs_info->pidonly = ctx->pidonly;
> +	if (ctx->mask & (1 << Opt_lookup)) {
> +		fs_info->lookup_list = ctx->lookup_list;
> +		ctx->lookup_list = NULL;
> +	}
>  }
>  
>  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> @@ -218,6 +290,14 @@ static int proc_reconfigure(struct fs_context *fc)
>  	struct super_block *sb = fc->root->d_sb;
>  	struct proc_fs_info *fs_info = proc_sb_info(sb);
>  
> +	/*
> +	 * "Hide everything" lookup filter is not a problem as only
> +	 * /proc/${pid}, /proc/self and /proc/thread-self are accessible.
> +	 */
> +	if (fs_info->lookup_list && fs_info->lookup_list->len > 0) {
> +		return invalfc(fc, "'-o remount,lookup=' is unsupported, unmount and mount instead");
> +	}
> +
>  	sync_filesystem(sb);
>  
>  	proc_apply_options(fs_info, fc, current_user_ns());
> @@ -234,11 +314,34 @@ static void proc_fs_context_free(struct fs_context *fc)
>  	struct proc_fs_context *ctx = fc->fs_private;
>  
>  	put_pid_ns(ctx->pid_ns);
> +	kfree(ctx->lookup_list);
>  	kfree(ctx);
>  }
>  
> +static int proc_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
> +{
> +	struct proc_fs_context *src = fc->fs_private;
> +	struct proc_fs_context *dst;
> +
> +	dst = kmemdup(src, sizeof(struct proc_fs_context), GFP_KERNEL);
> +	if (!dst) {
> +		return -ENOMEM;
> +	}
> +
> +	dst->lookup_list = kmemdup(dst->lookup_list, dst->lookup_list_len, GFP_KERNEL);
> +	if (!dst->lookup_list) {
> +		kfree(dst);
> +		return -ENOMEM;
> +	}
> +	get_pid_ns(dst->pid_ns);
> +
> +	fc->fs_private = dst;
> +	return 0;
> +}
> +
>  static const struct fs_context_operations proc_fs_context_ops = {
>  	.free		= proc_fs_context_free,
> +	.dup		= proc_fs_context_dup,
>  	.parse_param	= proc_parse_param,
>  	.get_tree	= proc_get_tree,
>  	.reconfigure	= proc_reconfigure,
> @@ -274,6 +377,7 @@ static void proc_kill_sb(struct super_block *sb)
>  
>  	kill_anon_super(sb);
>  	put_pid_ns(fs_info->pid_ns);
> +	kfree(fs_info->lookup_list);
>  	kfree(fs_info);
>  }
>  
> @@ -317,12 +421,33 @@ static int proc_root_getattr(struct user_namespace *mnt_userns,
>  	return 0;
>  }
>  
> +bool in_lookup_list(const struct proc_lookup_list *ll, const char *str, unsigned int len)
> +{
> +	if (ll) {
> +		for (; ll->len > 0; ll = lookup_list_next(ll)) {
> +			if (ll->len == len && strncmp(ll->str, str, len) == 0) {
> +				return true;
> +			}
> +		}
> +		return false;
> +	} else {
> +		return true;
> +	}
> +}
> +
>  static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
>  {
> +	struct proc_fs_info *proc_sb = proc_sb_info(dir->i_sb);
> +
>  	if (!proc_pid_lookup(dentry, flags))
>  		return NULL;
>  
> -	return proc_lookup(dir, dentry, flags);
> +	if (in_lookup_list(proc_sb->lookup_list, dentry->d_name.name, dentry->d_name.len)) {
> +		return proc_lookup(dir, dentry, flags);
> +	} else {
> +		return NULL;
> +	}
> +
>  }
>  
>  static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 069c7fd95396..d2c067560bf9 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -10,6 +10,7 @@
>  #include <linux/fs.h>
>  
>  struct proc_dir_entry;
> +struct proc_lookup_list;
>  struct seq_file;
>  struct seq_operations;
>  
> @@ -65,6 +66,7 @@ struct proc_fs_info {
>  	kgid_t pid_gid;
>  	enum proc_hidepid hide_pid;
>  	enum proc_pidonly pidonly;
> +	const struct proc_lookup_list *lookup_list;
>  };
>  
>  static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
> -- 
> 2.31.1
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 16:24 ` Christian Brauner
@ 2022-01-19 17:15   ` Alexey Gladkov
  2022-01-19 17:31     ` Christian Brauner
  2022-01-20 12:23   ` Alexey Dobriyan
  2022-01-20 12:32   ` Alexey Dobriyan
  2 siblings, 1 reply; 11+ messages in thread
From: Alexey Gladkov @ 2022-01-19 17:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexey Dobriyan, viro, ebiederm, akpm, linux-kernel,
	linux-fsdevel, stephen.s.brennan, cyphar

On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > From: Alexey Dobriyan <adobriyan@gmail.com>
> > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > 
> > Docker implements MaskedPaths configuration option
> > 
> > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > 
> > to disable certain /proc files. It overmounts them with /dev/null.
> > 
> > Implement proper mount option which selectively disables lookup/readdir
> > in the top level /proc directory so that MaskedPaths doesn't need
> > to be updated as time goes on.
> 
> I might've missed this when this was sent the last time so maybe it was
> clearly explained in an earlier thread: What's the reason this needs to
> live in the kernel?
> 
> The MaskedPaths entry is optional so runtimes aren't required to block
> anything by default and this mostly makes sense for workloads that run
> privileged.
> 
> In addition MaskedPaths is a generic option which allows to hide any
> existing path, not just proc. Even in the very docker-specific defaults
> /sys/firmware is covered.
> 
> I do see clear value in the subset= and hidepid= options. They are
> generally useful independent of opinionated container workloads. I don't
> see the same for lookup=.
> 
> An alternative I find more sensible is to add a new value for subset=
> that hides anything(?) that only global root should have read/write
> access too.

Or we can allow to change permissions in the procfs only in the direction
of decreasing (if some file has 644 then allow to set 640 or 600). In this
case, we will not need to constantly check the whitelist.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 17:15   ` Alexey Gladkov
@ 2022-01-19 17:31     ` Christian Brauner
  2022-01-19 18:30       ` Alexey Gladkov
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2022-01-19 17:31 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Alexey Dobriyan, viro, ebiederm, akpm, linux-kernel,
	linux-fsdevel, stephen.s.brennan, cyphar

On Wed, Jan 19, 2022 at 06:15:22PM +0100, Alexey Gladkov wrote:
> On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> > On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > > From: Alexey Dobriyan <adobriyan@gmail.com>
> > > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > > 
> > > Docker implements MaskedPaths configuration option
> > > 
> > > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > > 
> > > to disable certain /proc files. It overmounts them with /dev/null.
> > > 
> > > Implement proper mount option which selectively disables lookup/readdir
> > > in the top level /proc directory so that MaskedPaths doesn't need
> > > to be updated as time goes on.
> > 
> > I might've missed this when this was sent the last time so maybe it was
> > clearly explained in an earlier thread: What's the reason this needs to
> > live in the kernel?
> > 
> > The MaskedPaths entry is optional so runtimes aren't required to block
> > anything by default and this mostly makes sense for workloads that run
> > privileged.
> > 
> > In addition MaskedPaths is a generic option which allows to hide any
> > existing path, not just proc. Even in the very docker-specific defaults
> > /sys/firmware is covered.
> > 
> > I do see clear value in the subset= and hidepid= options. They are
> > generally useful independent of opinionated container workloads. I don't
> > see the same for lookup=.
> > 
> > An alternative I find more sensible is to add a new value for subset=
> > that hides anything(?) that only global root should have read/write
> > access too.
> 
> Or we can allow to change permissions in the procfs only in the direction
> of decreasing (if some file has 644 then allow to set 640 or 600). In this
> case, we will not need to constantly check the whitelist.

I don't fancy any filtering or allowlist approach. I find that rather
inelegant. But if I understand you correctly is that if we were to have
decreasing permissions we could allow a (namespace) procfs-admin to set
permissions so that the relevant files are essentially read-only or not
even readable at all for container workloads. So once you've lowered
perms you can't raise them which ensures even namespace procfs-admin
can't raise them again.
Might work as well. But that implies that we wouldn't need any allowlist
at all afaict.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 17:31     ` Christian Brauner
@ 2022-01-19 18:30       ` Alexey Gladkov
  0 siblings, 0 replies; 11+ messages in thread
From: Alexey Gladkov @ 2022-01-19 18:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexey Dobriyan, viro, ebiederm, akpm, linux-kernel,
	linux-fsdevel, stephen.s.brennan, cyphar

On Wed, Jan 19, 2022 at 06:31:07PM +0100, Christian Brauner wrote:
> On Wed, Jan 19, 2022 at 06:15:22PM +0100, Alexey Gladkov wrote:
> > On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> > > On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > > > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > > > From: Alexey Dobriyan <adobriyan@gmail.com>
> > > > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > > > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > > > 
> > > > Docker implements MaskedPaths configuration option
> > > > 
> > > > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > > > 
> > > > to disable certain /proc files. It overmounts them with /dev/null.
> > > > 
> > > > Implement proper mount option which selectively disables lookup/readdir
> > > > in the top level /proc directory so that MaskedPaths doesn't need
> > > > to be updated as time goes on.
> > > 
> > > I might've missed this when this was sent the last time so maybe it was
> > > clearly explained in an earlier thread: What's the reason this needs to
> > > live in the kernel?
> > > 
> > > The MaskedPaths entry is optional so runtimes aren't required to block
> > > anything by default and this mostly makes sense for workloads that run
> > > privileged.
> > > 
> > > In addition MaskedPaths is a generic option which allows to hide any
> > > existing path, not just proc. Even in the very docker-specific defaults
> > > /sys/firmware is covered.
> > > 
> > > I do see clear value in the subset= and hidepid= options. They are
> > > generally useful independent of opinionated container workloads. I don't
> > > see the same for lookup=.
> > > 
> > > An alternative I find more sensible is to add a new value for subset=
> > > that hides anything(?) that only global root should have read/write
> > > access too.
> > 
> > Or we can allow to change permissions in the procfs only in the direction
> > of decreasing (if some file has 644 then allow to set 640 or 600). In this
> > case, we will not need to constantly check the whitelist.
> 
> I don't fancy any filtering or allowlist approach. I find that rather
> inelegant.

Yep. I also don't find it very convenient if you need to allow more than
one or two files. That's why I didn't do anything like that when I
implemented subset=.

> But if I understand you correctly is that if we were to have
> decreasing permissions we could allow a (namespace) procfs-admin to set
> permissions so that the relevant files are essentially read-only or not
> even readable at all for container workloads. So once you've lowered
> perms you can't raise them which ensures even namespace procfs-admin
> can't raise them again.

Yes. This is what I meant.

> Might work as well. But that implies that we wouldn't need any allowlist
> at all afaict.

Yes, in this case we don't need a list.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 16:24 ` Christian Brauner
  2022-01-19 17:15   ` Alexey Gladkov
@ 2022-01-20 12:23   ` Alexey Dobriyan
  2022-01-20 14:37     ` Christian Brauner
  2022-01-20 12:32   ` Alexey Dobriyan
  2 siblings, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2022-01-20 12:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel,
	stephen.s.brennan, legion, cyphar

On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > From: Alexey Dobriyan <adobriyan@gmail.com>
> > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > 
> > Docker implements MaskedPaths configuration option
> > 
> > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > 
> > to disable certain /proc files. It overmounts them with /dev/null.
> > 
> > Implement proper mount option which selectively disables lookup/readdir
> > in the top level /proc directory so that MaskedPaths doesn't need
> > to be updated as time goes on.
> 
> I might've missed this when this was sent the last time so maybe it was
> clearly explained in an earlier thread: What's the reason this needs to
> live in the kernel?
> 
> The MaskedPaths entry is optional so runtimes aren't required to block
> anything by default and this mostly makes sense for workloads that run
> privileged.
> 
> In addition MaskedPaths is a generic option which allows to hide any
> existing path, not just proc. Even in the very docker-specific defaults
> /sys/firmware is covered.

MaskedPaths is not future proof, new entries might pop up and nobody
will update the MaskedPaths list.

> I do see clear value in the subset= and hidepid= options. They are
> generally useful independent of opinionated container workloads. I don't
> see the same for lookup=.

The value is if you get /proc/cpuinfo you get everything else
but you might not want everything else given that "everything else"
changes over time.

> An alternative I find more sensible is to add a new value for subset=
> that hides anything(?) that only global root should have read/write
> access too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 17:04 ` Alexey Gladkov
@ 2022-01-20 12:26   ` Alexey Dobriyan
  0 siblings, 0 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-01-20 12:26 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel, stephen.s.brennan

On Wed, Jan 19, 2022 at 06:04:32PM +0100, Alexey Gladkov wrote:
> On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > >From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > From: Alexey Dobriyan <adobriyan@gmail.com>
> > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > 
> > Docker implements MaskedPaths configuration option
> > 
> > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > 
> > to disable certain /proc files. It overmounts them with /dev/null.
> > 
> > Implement proper mount option which selectively disables lookup/readdir
> > in the top level /proc directory so that MaskedPaths doesn't need
> > to be updated as time goes on.
> > 
> > Syntax is
> > 
> > 			Filter everything
> > 	# mount -t proc -o lookup=/ proc /proc
> > 	# ls /proc
> > 	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
> > 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
> > 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163
> > 
> > 			Allow /proc/cpuinfo and /proc/uptime
> > 	# mount -t proc proc -o lookup=cpuinfo/uptime /proc
> > 
> > 	# ls /proc
> > 				...
> > 	dr-xr-xr-x   8 root       root          0 Nov 22 21:12 995
> > 	-r--r--r--   1 root       root          0 Nov 22 21:12 cpuinfo
> > 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 self -> 1163
> > 	lrwxrwxrwx   1 root       root          0 Nov 22 21:12 thread-self -> 1163/task/1163
> > 	-r--r--r--   1 root       root          0 Nov 22 21:12 uptime
> > 
> > Trailing slash is optional but saves 1 allocation.
> > Trailing slash is mandatory for "filter everything".
> > 
> > Remounting with lookup= is disabled so that files and dcache entries
> > don't stay active while filter list is changed. Users are supposed
> > to unmount and mount again with different lookup= set.
> > Remount rules may change in the future. (Eric W. Biederman)
> > 
> > Re: speed
> > This is the price for filtering, given that lookup= is whitelist it is
> > not supposed to be very long. Second, it is one linear memory scan per
> > lookup, there are no linked lists. It may be faster than rbtree in fact.
> > It consumes 1 allocation per superblock which is list of names itself.
> > 
> > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> > ---
> > 
> > 	v2
> > 	documentation!
> > 	descriptive comments!
> > 	disable remount
> > 
> >  Documentation/filesystems/proc.rst |   8 ++
> >  fs/proc/generic.c                  |  18 ++--
> >  fs/proc/internal.h                 |  31 ++++++-
> >  fs/proc/proc_net.c                 |   2 +-
> >  fs/proc/root.c                     | 127 ++++++++++++++++++++++++++++-
> >  include/linux/proc_fs.h            |   2 +
> >  6 files changed, 178 insertions(+), 10 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 8d7f141c6fc7..9a328f0b4346 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2186,6 +2186,7 @@ The following mount options are supported:
> >  	hidepid=	Set /proc/<pid>/ access mode.
> >  	gid=		Set the group authorized to learn processes information.
> >  	subset=		Show only the specified subset of procfs.
> > +        lookup=         Top-level /proc filter, independent of subset=
> 
> Will it be possible to combine lookup= and subset= options when mounting?

Currently only subset=pid is implemented, which is equivalent to

	mount -t proc -o lookup=/ proc /proc

In the future subset= might expand and lookup= could filter whatever
exposed.

> > +lookup= mount option makes available only listed files/directories in
> > +the top-level /proc directory. Individual names are separated
> > +by slash. Empty list is equivalent to subset=pid. lookup= filters before
> > +subset= if both options are supplied. lookup= doesn't affect /proc/${pid}
> > +directories availability as well as /proc/self and /proc/thread-self
> > +symlinks. More fine-grained filtering is not supported at the moment.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-19 16:24 ` Christian Brauner
  2022-01-19 17:15   ` Alexey Gladkov
  2022-01-20 12:23   ` Alexey Dobriyan
@ 2022-01-20 12:32   ` Alexey Dobriyan
  2 siblings, 0 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-01-20 12:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel,
	stephen.s.brennan, legion, cyphar

On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > From: Alexey Dobriyan <adobriyan@gmail.com>
> > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > 
> > Docker implements MaskedPaths configuration option
> > 
> > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > 
> > to disable certain /proc files. It overmounts them with /dev/null.
> > 
> > Implement proper mount option which selectively disables lookup/readdir
> > in the top level /proc directory so that MaskedPaths doesn't need
> > to be updated as time goes on.
> 
> I might've missed this when this was sent the last time so maybe it was
> clearly explained in an earlier thread: What's the reason this needs to
> live in the kernel?

The reasons are:
	MaskedPaths or equivalents are blacklists, not future proof

	MaskedPaths is applied at container creation once,
	lookup= is applied at mount time surely but names aren't
	required to exist to be filtered (read: some silly ISV module
	gets loaded, creates /proc entries, containers get them with all
	security holes)

> The MaskedPaths entry is optional so runtimes aren't required to block
> anything by default and this mostly makes sense for workloads that run
> privileged.
> 
> In addition MaskedPaths is a generic option which allows to hide any
> existing path, not just proc. Even in the very docker-specific defaults
> /sys/firmware is covered.

Sure, the patch is for /proc only. MaskedPaths can't overmount with
/dev/null file which doesn't exist yet.

> I do see clear value in the subset= and hidepid= options. They are
> generally useful independent of opinionated container workloads. I don't
> see the same for lookup=.
> 
> An alternative I find more sensible is to add a new value for subset=
> that hides anything(?) that only global root should have read/write
> access too.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] proc: "mount -o lookup=" support
  2022-01-20 12:23   ` Alexey Dobriyan
@ 2022-01-20 14:37     ` Christian Brauner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2022-01-20 14:37 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: viro, ebiederm, akpm, linux-kernel, linux-fsdevel,
	stephen.s.brennan, legion, cyphar

On Thu, Jan 20, 2022 at 03:32:29PM +0300, Alexey Dobriyan wrote:
> On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> > On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > > From: Alexey Dobriyan <adobriyan@gmail.com>
> > > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > > 
> > > Docker implements MaskedPaths configuration option
> > > 
> > > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > > 
> > > to disable certain /proc files. It overmounts them with /dev/null.
> > > 
> > > Implement proper mount option which selectively disables lookup/readdir
> > > in the top level /proc directory so that MaskedPaths doesn't need
> > > to be updated as time goes on.
> > 
> > I might've missed this when this was sent the last time so maybe it was
> > clearly explained in an earlier thread: What's the reason this needs to
> > live in the kernel?
> 
> The reasons are:
> 	MaskedPaths or equivalents are blacklists, not future proof
> 
> 	MaskedPaths is applied at container creation once,
> 	lookup= is applied at mount time surely but names aren't
> 	required to exist to be filtered (read: some silly ISV module
> 	gets loaded, creates /proc entries, containers get them with all
> 	security holes)
> 
> > The MaskedPaths entry is optional so runtimes aren't required to block
> > anything by default and this mostly makes sense for workloads that run
> > privileged.
> > 
> > In addition MaskedPaths is a generic option which allows to hide any
> > existing path, not just proc. Even in the very docker-specific defaults
> > /sys/firmware is covered.
> 
> Sure, the patch is for /proc only. MaskedPaths can't overmount with
> /dev/null file which doesn't exist yet.
> 
> > I do see clear value in the subset= and hidepid= options. They are
> > generally useful independent of opinionated container workloads. I don't
> > see the same for lookup=.
> > 
> > An alternative I find more sensible is to add a new value for subset=
> > that hides anything(?) that only global root should have read/write
> > access too.

On Thu, Jan 20, 2022 at 03:23:04PM +0300, Alexey Dobriyan wrote:
> On Wed, Jan 19, 2022 at 05:24:23PM +0100, Christian Brauner wrote:
> > On Wed, Jan 19, 2022 at 06:48:03PM +0300, Alexey Dobriyan wrote:
> > > From 61376c85daab50afb343ce50b5a97e562bc1c8d3 Mon Sep 17 00:00:00 2001
> > > From: Alexey Dobriyan <adobriyan@gmail.com>
> > > Date: Mon, 22 Nov 2021 20:41:06 +0300
> > > Subject: [PATCH 1/1] proc: "mount -o lookup=..." support
> > > 
> > > Docker implements MaskedPaths configuration option
> > > 
> > > 	https://github.com/estesp/docker/blob/9c15e82f19b0ad3c5fe8617a8ec2dddc6639f40a/oci/defaults.go#L97
> > > 
> > > to disable certain /proc files. It overmounts them with /dev/null.
> > > 
> > > Implement proper mount option which selectively disables lookup/readdir
> > > in the top level /proc directory so that MaskedPaths doesn't need
> > > to be updated as time goes on.
> > 
> > I might've missed this when this was sent the last time so maybe it was
> > clearly explained in an earlier thread: What's the reason this needs to
> > live in the kernel?
> > 
> > The MaskedPaths entry is optional so runtimes aren't required to block
> > anything by default and this mostly makes sense for workloads that run
> > privileged.
> > 
> > In addition MaskedPaths is a generic option which allows to hide any
> > existing path, not just proc. Even in the very docker-specific defaults
> > /sys/firmware is covered.
> 
> MaskedPaths is not future proof, new entries might pop up and nobody
> will update the MaskedPaths list.
> 
> > I do see clear value in the subset= and hidepid= options. They are
> > generally useful independent of opinionated container workloads. I don't
> > see the same for lookup=.
> 
> The value is if you get /proc/cpuinfo you get everything else
> but you might not want everything else given that "everything else"
> changes over time.
> 
> > An alternative I find more sensible is to add a new value for subset=
> > that hides anything(?) that only global root should have read/write
> > access too.

Thanks for providing some more details.

If we really introduce new proc files in the future that are unsafe for
unprivileged containers then that's a whole bigger problem.

We shouldn't taper over this with a procfs mount option however.
Especially, since it's very likely that such new procfs files that would
be exploitable in unprivileged containers would also be exploitable by
regular users. The argument can't be that in order to protect against
buggy or information leaking future proc files we need to give proc a
special mount option for containers to restrict access to files and
directories.

And for the legacy files that existed before containers were a big thing
MaskedPath in userspace has worked fine with the last changes to update
the list from 2018 for the addition of a rather old directory.

And the same problem exists for sysfs. That's why /sys/firmware is in
there. (In fact, it can be argued that they should restrict sysfs way
more via MaskedPaths than procfs for privileged containers since it
leaks way more system-wide information and provides a way bigger attack
surface which is presumable why the mount is ro but then strangely only
hide /sys/firmware. Anyway, besides the point.)

MaskedPath is mostly a protection mechanism useful for privileged
containers as an unprivileged container can't modify anything that would
allow it to attack the system.

Ultimately, I think the current proposal here is too much modeled after
how a specific tool runs specific workloads and for containers and I
don't think that's a good idea.

We should do a generally useful thing that doesn't require any dynamic
filtering and userspace giving us files that are ok to show.

Alternative proposals to appear later in the thread. I'd be ok to
endorse one of those if you were to implement one of them. But for now
I'm not firmly convinced of lookup=.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-20 14:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-19 15:48 [PATCH v2] proc: "mount -o lookup=" support Alexey Dobriyan
2022-01-19 15:57 ` Alexey Dobriyan
2022-01-19 16:24 ` Christian Brauner
2022-01-19 17:15   ` Alexey Gladkov
2022-01-19 17:31     ` Christian Brauner
2022-01-19 18:30       ` Alexey Gladkov
2022-01-20 12:23   ` Alexey Dobriyan
2022-01-20 14:37     ` Christian Brauner
2022-01-20 12:32   ` Alexey Dobriyan
2022-01-19 17:04 ` Alexey Gladkov
2022-01-20 12:26   ` Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).