[PATCH 0/2] proc: use subset option to hide some top-level procfs entries

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
@ 2020-06-04 20:04 Alexey Gladkov
  2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-04 20:04 UTC (permalink / raw)
  To: LKML
  Cc: Linux FS Devel, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andy Lutomirski, Eric W . Biederman, Kees Cook

Greetings!

Preface
-------
This patch set can be applied over:

git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788

Overview
--------
Directories and files can be created and deleted by dynamically loaded modules.
Not all of these files are virtualized and safe inside the container.

However, subset=pid is not enough because many containers wants to have
/proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
files per procfs mountpoint.

Introduced changes
------------------
Allow to specify the names of files and directories in the subset= parameter and
thereby make a whitelist of top-level permitted names.


Alexey Gladkov (2):
  proc: use subset option to hide some top-level procfs entries
  docs: proc: update documentation about subset= parameter

 Documentation/filesystems/proc.rst |  6 +++
 fs/proc/base.c                     | 15 +++++-
 fs/proc/generic.c                  | 75 +++++++++++++++++++++------
 fs/proc/inode.c                    | 18 ++++---
 fs/proc/internal.h                 | 12 +++++
 fs/proc/root.c                     | 81 ++++++++++++++++++++++++------
 include/linux/proc_fs.h            | 11 ++--
 7 files changed, 175 insertions(+), 43 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 20:04 [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Alexey Gladkov
@ 2020-06-04 20:04 ` Alexey Gladkov
  2020-06-05  2:19   ` kernel test robot
  2020-06-05  2:19   ` [PATCH] proc: fix boolreturn.cocci warnings kernel test robot
  2020-06-04 20:04 ` [PATCH 2/2] docs: proc: update documentation about subset= parameter Alexey Gladkov
  2020-06-04 20:33 ` [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Eric W. Biederman
  2 siblings, 2 replies; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-04 20:04 UTC (permalink / raw)
  To: LKML
  Cc: Linux FS Devel, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andy Lutomirski, Eric W . Biederman, Kees Cook

In addition to subset=pid, added the ability to specify top-level
directory and file names.

Not all directories in procfs have proc directory entries. For example,
/proc/sys has its own directory operations and does not have proc
directory entries. But all paths in procfs have at least one top-level
proc directory entry.

Since files or directories can be created by kernel modules as they are
loaded, we use a list of names to check a proc directory entries.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c          | 15 +++++++-
 fs/proc/generic.c       | 75 ++++++++++++++++++++++++++++++--------
 fs/proc/inode.c         | 18 ++++++---
 fs/proc/internal.h      | 12 ++++++
 fs/proc/root.c          | 81 +++++++++++++++++++++++++++++++++--------
 include/linux/proc_fs.h | 11 +++---
 6 files changed, 169 insertions(+), 43 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b1d94d14ed5a..9851a1f80e8c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3329,6 +3329,10 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 		goto out;
 
 	fs_info = proc_sb_info(dentry->d_sb);
+
+	if (!is_pids_visible(fs_info))
+		goto out;
+
 	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
@@ -3388,13 +3392,18 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
-	struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
-	struct pid_namespace *ns = proc_pid_ns(file_inode(file)->i_sb);
+	struct proc_fs_info *fs_info;
+	struct pid_namespace *ns;
 	loff_t pos = ctx->pos;
 
 	if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
 		return 0;
 
+	fs_info = proc_sb_info(file_inode(file)->i_sb);
+
+	if (!is_pids_visible(fs_info))
+		goto out;
+
 	if (pos == TGID_OFFSET - 2) {
 		struct inode *inode = d_inode(fs_info->proc_self);
 		if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
@@ -3407,6 +3416,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 			return 0;
 		ctx->pos = pos = pos + 1;
 	}
+	ns = proc_pid_ns(file_inode(file)->i_sb);
 	iter.tgid = pos - TGID_OFFSET;
 	iter.task = NULL;
 	for (iter = next_tgid(ns, iter);
@@ -3427,6 +3437,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 			return 0;
 		}
 	}
+out:
 	ctx->pos = PID_MAX_LIMIT + TGID_OFFSET;
 	return 0;
 }
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 2f9fa179194d..e96a593d3b09 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -196,6 +196,48 @@ static int xlate_proc_name(const char *name, struct proc_dir_entry **ret,
 	return rv;
 }
 
+bool is_pde_visible(struct proc_fs_info *fs_info, struct proc_dir_entry *pde)
+{
+	int i;
+
+	if (!fs_info->whitelist || pde == &proc_root)
+		return 1;
+
+	read_lock(&proc_subdir_lock);
+
+	for (i = 0; fs_info->whitelist[i]; i++) {
+		struct proc_dir_entry *ent, *de;
+
+		if (!strcmp(fs_info->whitelist[i], "pid"))
+			continue;
+
+		ent = pde_subdir_find(&proc_root, fs_info->whitelist[i],
+				strlen(fs_info->whitelist[i]));
+		if (!ent)
+			continue;
+
+		if (ent == pde) {
+			read_unlock(&proc_subdir_lock);
+			return 1;
+		}
+
+		if (!S_ISDIR(ent->mode))
+			continue;
+
+		de = pde->parent;
+		while (de != &proc_root) {
+			if (ent == de) {
+				read_unlock(&proc_subdir_lock);
+				return 1;
+			}
+			de = de->parent;
+		}
+	}
+
+	read_unlock(&proc_subdir_lock);
+	return 0;
+}
+
 static DEFINE_IDA(proc_inum_ida);
 
 #define PROC_DYNAMIC_FIRST 0xF0000000U
@@ -251,6 +293,9 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
 {
 	struct inode *inode;
 
+	if (!is_visible(dir))
+		return ERR_PTR(-ENOENT);
+
 	read_lock(&proc_subdir_lock);
 	de = pde_subdir_find(de, dentry->d_name.name, dentry->d_name.len);
 	if (de) {
@@ -259,6 +304,8 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
 		inode = proc_get_inode(dir->i_sb, de);
 		if (!inode)
 			return ERR_PTR(-ENOMEM);
+		if (!is_visible(inode))
+			return ERR_PTR(-ENOENT);
 		d_set_d_op(dentry, de->proc_dops);
 		return d_splice_alias(inode, dentry);
 	}
@@ -269,11 +316,6 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
 struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
 		unsigned int flags)
 {
-	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
-
-	if (fs_info->pidonly == PROC_PIDONLY_ON)
-		return ERR_PTR(-ENOENT);
-
 	return proc_lookup_de(dir, dentry, PDE(dir));
 }
 
@@ -289,11 +331,17 @@ struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
 int proc_readdir_de(struct file *file, struct dir_context *ctx,
 		    struct proc_dir_entry *de)
 {
+	struct proc_fs_info *fs_info;
 	int i;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
 
+	fs_info = proc_sb_info(file_inode(file)->i_sb);
+
+	if (!is_pde_visible(fs_info, de))
+		return 0;
+
 	i = ctx->pos - 2;
 	read_lock(&proc_subdir_lock);
 	de = pde_subdir_first(de);
@@ -312,12 +360,14 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
 		struct proc_dir_entry *next;
 		pde_get(de);
 		read_unlock(&proc_subdir_lock);
-		if (!dir_emit(ctx, de->name, de->namelen,
-			    de->low_ino, de->mode >> 12)) {
-			pde_put(de);
-			return 0;
+		if (is_pde_visible(fs_info, de)) {
+			if (!dir_emit(ctx, de->name, de->namelen,
+				    de->low_ino, de->mode >> 12)) {
+				pde_put(de);
+				return 0;
+			}
+			ctx->pos++;
 		}
-		ctx->pos++;
 		read_lock(&proc_subdir_lock);
 		next = pde_subdir_next(de);
 		pde_put(de);
@@ -330,11 +380,6 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
-	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
-
-	if (fs_info->pidonly == PROC_PIDONLY_ON)
-		return 1;
-
 	return proc_readdir_de(file, ctx, PDE(inode));
 }
 
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index f40c2532c057..61374eb76ce4 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -181,13 +181,20 @@ static inline const char *hidepid2str(enum proc_hidepid v)
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
+	int i;
 
 	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
 	if (fs_info->hide_pid != HIDEPID_OFF)
 		seq_printf(seq, ",hidepid=%s", hidepid2str(fs_info->hide_pid));
-	if (fs_info->pidonly != PROC_PIDONLY_OFF)
-		seq_printf(seq, ",subset=pid");
+	if (fs_info->whitelist) {
+		seq_puts(seq, ",subset=");
+		for (i = 0; fs_info->whitelist[i]; i++) {
+			if (i)
+				seq_puts(seq, ":");
+			seq_printf(seq, "%s", fs_info->whitelist[i]);
+		}
+	}
 
 	return 0;
 }
@@ -478,13 +485,15 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
 
 static int proc_reg_open(struct inode *inode, struct file *file)
 {
-	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct proc_dir_entry *pde = PDE(inode);
 	int rv = 0;
 	typeof_member(struct proc_ops, proc_open) open;
 	typeof_member(struct proc_ops, proc_release) release;
 	struct pde_opener *pdeo;
 
+	if (!is_visible(inode))
+		return -ENOENT;
+
 	if (pde_is_permanent(pde)) {
 		open = pde->proc_ops->proc_open;
 		if (open)
@@ -492,9 +501,6 @@ static int proc_reg_open(struct inode *inode, struct file *file)
 		return rv;
 	}
 
-	if (fs_info->pidonly == PROC_PIDONLY_ON)
-		return -ENOENT;
-
 	/*
 	 * Ensure that
 	 * 1) PDE's ->release hook will be called no matter what
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 917cc85e3466..2f14a3692fc1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -203,6 +203,18 @@ static inline bool is_empty_pde(const struct proc_dir_entry *pde)
 }
 extern ssize_t proc_simple_write(struct file *, const char __user *, size_t, loff_t *);
 
+extern bool is_pde_visible(struct proc_fs_info *fs_info, struct proc_dir_entry *pde);
+
+static inline bool is_pids_visible(struct proc_fs_info *fs_info)
+{
+	return (fs_info->pids_visibility == PROC_PIDS_VISIBLE);
+}
+
+static inline bool is_visible(struct inode *inode)
+{
+	return is_pde_visible(proc_sb_info(inode->i_sb), PDE(inode));
+}
+
 /*
  * inode.c
  */
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ffebed1999e5..5401d88abe9d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -34,7 +34,8 @@ struct proc_fs_context {
 	unsigned int		mask;
 	enum proc_hidepid	hidepid;
 	int			gid;
-	enum proc_pidonly	pidonly;
+	enum proc_pids_visible	pids_visibility;
+	char			**whitelist;
 };
 
 enum proc_param {
@@ -89,26 +90,71 @@ static int proc_parse_hidepid_param(struct fs_context *fc, struct fs_parameter *
 	return 0;
 }
 
-static int proc_parse_subset_param(struct fs_context *fc, char *value)
+static int proc_parse_subset_param(struct fs_context *fc, const char *value)
 {
+	int i, elemsiz, siz;
+	const char *elem, *endelem, *delim, *endval;
+	char *ptr;
 	struct proc_fs_context *ctx = fc->fs_private;
 
-	while (value) {
-		char *ptr = strchr(value, ',');
+	if (strpbrk(value, "/ "))
+		return invalf(fc, "proc: unsupported subset value - `%s'", value);
 
-		if (ptr != NULL)
-			*ptr++ = '\0';
+	endval = value + strlen(value) + 1;
+	siz = i = 0;
 
-		if (*value != '\0') {
-			if (!strcmp(value, "pid")) {
-				ctx->pidonly = PROC_PIDONLY_ON;
-			} else {
-				return invalf(fc, "proc: unsupported subset option - %s\n", value);
-			}
+	for (delim = elem = value; delim; elem = endelem) {
+		delim = strchr(elem, ':');
+
+		endelem = delim ? ++delim : endval;
+		elemsiz = endelem - elem;
+
+		if (elemsiz > 1) {
+			siz += sizeof(char *) + elemsiz;
+			i++;
 		}
-		value = ptr;
 	}
 
+	if (!i)
+		return invalf(fc, "proc: empty subset value is not valid");
+
+	siz += sizeof(char *);
+	i++;
+
+	kfree(ctx->whitelist);
+	ctx->whitelist = kmalloc(siz, GFP_KERNEL);
+
+	if (!ctx->whitelist) {
+		errorf(fc, "proc: unable to allocate enough memory to store subset filter");
+		return -ENOMEM;
+	}
+
+	ctx->pids_visibility = PROC_PIDS_INVISIBLE;
+
+	ptr = (char *)(ctx->whitelist + i);
+	siz = i = 0;
+
+	for (delim = elem = value; delim; elem = endelem) {
+		delim = strchr(elem, ':');
+
+		endelem = delim ? ++delim : endval;
+		elemsiz = endelem - elem;
+
+		if (elemsiz <= 1)
+			continue;
+
+		if (!strncmp("pid", elem, elemsiz - 1))
+			ctx->pids_visibility = PROC_PIDS_VISIBLE;
+
+		strncpy(ptr, elem, elemsiz);
+		ctx->whitelist[i] = ptr;
+		ctx->whitelist[i][elemsiz - 1] = '\0';
+
+		ptr += elemsiz;
+		i++;
+	}
+	ctx->whitelist[i] = NULL;
+
 	return 0;
 }
 
@@ -155,8 +201,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
-	if (ctx->mask & (1 << Opt_subset))
-		fs_info->pidonly = ctx->pidonly;
+	if (ctx->mask & (1 << Opt_subset)) {
+		kfree(fs_info->whitelist);
+		fs_info->whitelist = ctx->whitelist;
+		fs_info->pids_visibility = ctx->pids_visibility;
+	}
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -270,6 +319,8 @@ static void proc_kill_sb(struct super_block *sb)
 	if (fs_info->proc_thread_self)
 		dput(fs_info->proc_thread_self);
 
+	kfree(fs_info->whitelist);
+
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
 	kfree(fs_info);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 6ec524d8842c..74d56ddb67b6 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -50,10 +50,10 @@ enum proc_hidepid {
 	HIDEPID_NOT_PTRACEABLE = 4, /* Limit pids to only ptraceable pids */
 };
 
-/* definitions for proc mount option pidonly */
-enum proc_pidonly {
-	PROC_PIDONLY_OFF = 0,
-	PROC_PIDONLY_ON  = 1,
+/* definitions for pid visibility */
+enum proc_pids_visible {
+	PROC_PIDS_VISIBLE   = 0,
+	PROC_PIDS_INVISIBLE = 1,
 };
 
 struct proc_fs_info {
@@ -62,7 +62,8 @@ struct proc_fs_info {
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
 	kgid_t pid_gid;
 	enum proc_hidepid hide_pid;
-	enum proc_pidonly pidonly;
+	enum proc_pids_visible pids_visibility;
+	char **whitelist;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] docs: proc: update documentation about subset= parameter
  2020-06-04 20:04 [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Alexey Gladkov
  2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
@ 2020-06-04 20:04 ` Alexey Gladkov
  2020-06-04 20:33 ` [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Eric W. Biederman
  2 siblings, 0 replies; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-04 20:04 UTC (permalink / raw)
  To: LKML
  Cc: Linux FS Devel, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andy Lutomirski, Eric W . Biederman, Kees Cook

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 Documentation/filesystems/proc.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e2ecf248feb5..68acd335cd8c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2177,6 +2177,12 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+subset=NAME[:NAME...] allows to give access to the listed file names or
+directories located at the top level of procfs. These files and directories
+do not have to be present at the time of mounting procfs. They may appear later
+when the kernel module that creates them loads. The list of names can be
+combined with subst=pid.
+
 5	Filesystem behavior
 ----------------------------
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 20:04 [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Alexey Gladkov
  2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
  2020-06-04 20:04 ` [PATCH 2/2] docs: proc: update documentation about subset= parameter Alexey Gladkov
@ 2020-06-04 20:33 ` Eric W. Biederman
  2020-06-04 21:32   ` Christian Brauner
  2020-06-05  0:08   ` Alexey Gladkov
  2 siblings, 2 replies; 11+ messages in thread
From: Eric W. Biederman @ 2020-06-04 20:33 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Alexey Gladkov, Andy Lutomirski, Kees Cook, Linux Containers

Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> Greetings!
>
> Preface
> -------
> This patch set can be applied over:
>
> git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788

I am not going to seriously look at this for merging until after the
merge window closes. 

Have you thought about the possibility of relaxing the permission checks
to mount proc such that we don't need to verify there is an existing
mount of proc?  With just the subset pids I think this is feasible.  It
might not be worth it at this point, but it is definitely worth asking
the question.  As one of the benefits early propopents of the idea of a
subset of proc touted was that they would not be as restricted as they
are with today's proc.

I ask because this has a bearing on the other options you are playing
with.

Do we want to find a way to have the benefit of relaxed permission
checks while still including a few more files.

> Overview
> --------
> Directories and files can be created and deleted by dynamically loaded modules.
> Not all of these files are virtualized and safe inside the container.
>
> However, subset=pid is not enough because many containers wants to have
> /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> files per procfs mountpoint.

Is it desirable to have meminfo and cpuinfo as they are today or do
people want them to reflect the ``container'' context.   So that
applications like the JVM don't allocation too many cpus or don't try
and consume too much memory, or run on nodes that cgroups current make
unavailable.

Are there any users or planned users of this functionality yet?

I am concerned that you might be adding functionality that no one will
ever use that will just add code to the kernel that no one cares about,
that will then accumulate bugs.  Having had to work through a few of
those cases to make each mount of proc have it's own super block I am
not a great fan of adding another one.

If the runc, lxc and other container runtime folks can productively use
such and option to do useful things and they are sensible things to do I
don't have any fundamental objection.  But I do want to be certain this
is a feature that is going to be used.

Eric

> Introduced changes
> ------------------
> Allow to specify the names of files and directories in the subset= parameter and
> thereby make a whitelist of top-level permitted names.
>
>
> Alexey Gladkov (2):
>   proc: use subset option to hide some top-level procfs entries
>   docs: proc: update documentation about subset= parameter
>
>  Documentation/filesystems/proc.rst |  6 +++
>  fs/proc/base.c                     | 15 +++++-
>  fs/proc/generic.c                  | 75 +++++++++++++++++++++------
>  fs/proc/inode.c                    | 18 ++++---
>  fs/proc/internal.h                 | 12 +++++
>  fs/proc/root.c                     | 81 ++++++++++++++++++++++++------
>  include/linux/proc_fs.h            | 11 ++--
>  7 files changed, 175 insertions(+), 43 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 20:33 ` [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Eric W. Biederman
@ 2020-06-04 21:32   ` Christian Brauner
  2020-06-05  0:28     ` Alexey Gladkov
  2020-06-05  0:08   ` Alexey Gladkov
  1 sibling, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2020-06-04 21:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Gladkov, Kees Cook, Linux Containers, LKML,
	Alexander Viro, Andy Lutomirski, Linux FS Devel, Alexey Gladkov,
	Alexey Dobriyan, Stéphane Graber

On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > Greetings!
> >
> > Preface
> > -------
> > This patch set can be applied over:
> >
> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
> 
> I am not going to seriously look at this for merging until after the
> merge window closes. 
> 
> Have you thought about the possibility of relaxing the permission checks
> to mount proc such that we don't need to verify there is an existing
> mount of proc?  With just the subset pids I think this is feasible.  It
> might not be worth it at this point, but it is definitely worth asking
> the question.  As one of the benefits early propopents of the idea of a
> subset of proc touted was that they would not be as restricted as they
> are with today's proc.
> 
> I ask because this has a bearing on the other options you are playing
> with.
> 
> Do we want to find a way to have the benefit of relaxed permission
> checks while still including a few more files.
> 
> > Overview
> > --------
> > Directories and files can be created and deleted by dynamically loaded modules.
> > Not all of these files are virtualized and safe inside the container.
> >
> > However, subset=pid is not enough because many containers wants to have
> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> > files per procfs mountpoint.
> 
> Is it desirable to have meminfo and cpuinfo as they are today or do
> people want them to reflect the ``container'' context.   So that
> applications like the JVM don't allocation too many cpus or don't try
> and consume too much memory, or run on nodes that cgroups current make
> unavailable.
> 
> Are there any users or planned users of this functionality yet?
> 
> I am concerned that you might be adding functionality that no one will
> ever use that will just add code to the kernel that no one cares about,
> that will then accumulate bugs.  Having had to work through a few of
> those cases to make each mount of proc have it's own super block I am
> not a great fan of adding another one.
> 
> If the runc, lxc and other container runtime folks can productively use
> such and option to do useful things and they are sensible things to do I
> don't have any fundamental objection.  But I do want to be certain this
> is a feature that is going to be used.

I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but
I haven't had time to look at this patchset).
In any case, we are currently virtualizing:
/proc/cpuinfo
/proc/diskstats
/proc/loadavg
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
for each container with a tiny in-userspace filesystem LXCFS
( https://github.com/lxc/lxcfs )
and have been doing that for years.
Having meminfo and cpuinfo virtualized in procfs was something we have
been wanting for a long time and there have been patches by other people
(from Siteground, I believe) to achieve this a few years back but were
disregarded.

I think meminfo and cpuinfo would already be great. And if we're
virtualizing cpuinfo we also need to virtualize the cpu bits exposed in
/proc/stat. It would also be great to virtualize /proc/uptime. Right now
we're achieving this essentially by substracting the time the init
process of the pid namespace has started since system boot time, minus
the time when the system started to get the actual reaper age (It's a
bit more involved but that's the gist.).

This is all on the topic list for this year's virtual container's
microconference at Plumber's and I would suggest we try to discuss the
various requirements for something like this there. (I'm about to send
the CFP out.)

Christian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 20:33 ` [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Eric W. Biederman
  2020-06-04 21:32   ` Christian Brauner
@ 2020-06-05  0:08   ` Alexey Gladkov
  2020-06-05  4:17     ` Eric W. Biederman
  1 sibling, 1 reply; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-05  0:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Andy Lutomirski, Kees Cook, Linux Containers

On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > Greetings!
> >
> > Preface
> > -------
> > This patch set can be applied over:
> >
> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
> 
> I am not going to seriously look at this for merging until after the
> merge window closes. 

OK. I'll wait.

> Have you thought about the possibility of relaxing the permission checks
> to mount proc such that we don't need to verify there is an existing
> mount of proc?  With just the subset pids I think this is feasible.  It
> might not be worth it at this point, but it is definitely worth asking
> the question.  As one of the benefits early propopents of the idea of a
> subset of proc touted was that they would not be as restricted as they
> are with today's proc.

I'm not sure I follow.

What do you mean by the possibility of relaxing the permission checks to
mount proc?

Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
options? If so then this is an interesting idea.

> I ask because this has a bearing on the other options you are playing
> with.

I can not agree with this because I do not touch on other options.
The hidepid and subset=pid has no relation to the visibility of regular
files. On the other hand, in procfs there is absolutely no way to restrict
access other than selinux.

> Do we want to find a way to have the benefit of relaxed permission
> checks while still including a few more files.

In fact, I see no problem allowing the user to mount procfs with the
hidepid=2,subset=pid options.

We can make subset=self, which would allow not only pids subset but also
other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever
add virtualization to meminfo, cpuinfo etc.

> > Overview
> > --------
> > Directories and files can be created and deleted by dynamically loaded modules.
> > Not all of these files are virtualized and safe inside the container.
> >
> > However, subset=pid is not enough because many containers wants to have
> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
> > files per procfs mountpoint.
> 
> Is it desirable to have meminfo and cpuinfo as they are today or do
> people want them to reflect the ``container'' context.   So that
> applications like the JVM don't allocation too many cpus or don't try
> and consume too much memory, or run on nodes that cgroups current make
> unavailable.

Of course, it would be better if these files took into account the
limitations of cgroups or some kind of ``containerized'' context.

> Are there any users or planned users of this functionality yet?

I know that java uses meminfo for sure.

The purpose of this patch is to isolate the container from unwanted files
in procfs.

> I am concerned that you might be adding functionality that no one will
> ever use that will just add code to the kernel that no one cares about,
> that will then accumulate bugs.  Having had to work through a few of
> those cases to make each mount of proc have it's own super block I am
> not a great fan of adding another one.
>
> If the runc, lxc and other container runtime folks can productively use
> such and option to do useful things and they are sensible things to do I
> don't have any fundamental objection.  But I do want to be certain this
> is a feature that is going to be used.

Ok, just an example how docker or runc (actually almost all golang-based
container systems) is trying to block access to something in procfs:

$ docker run -it --rm busybox
# mount |grep /proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

For now I'm just trying ti create a better way to restrict access in
the procfs than this since procfs is used in containers.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 21:32   ` Christian Brauner
@ 2020-06-05  0:28     ` Alexey Gladkov
  0 siblings, 0 replies; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-05  0:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Kees Cook, Linux Containers, LKML,
	Alexander Viro, Andy Lutomirski, Linux FS Devel, Alexey Dobriyan,
	Stéphane Graber

On Thu, Jun 04, 2020 at 11:32:20PM +0200, Christian Brauner wrote:
> > Is it desirable to have meminfo and cpuinfo as they are today or do
> > people want them to reflect the ``container'' context.   So that
> > applications like the JVM don't allocation too many cpus or don't try
> > and consume too much memory, or run on nodes that cgroups current make
> > unavailable.
> > 
> > Are there any users or planned users of this functionality yet?
> > 
> > I am concerned that you might be adding functionality that no one will
> > ever use that will just add code to the kernel that no one cares about,
> > that will then accumulate bugs.  Having had to work through a few of
> > those cases to make each mount of proc have it's own super block I am
> > not a great fan of adding another one.
> > 
> > If the runc, lxc and other container runtime folks can productively use
> > such and option to do useful things and they are sensible things to do I
> > don't have any fundamental objection.  But I do want to be certain this
> > is a feature that is going to be used.
> 
> I'm not sure Alexey is introducing virtualized meminfo and cpuinfo (but
> I haven't had time to look at this patchset).

No. Not yet :) I just suggest a way to restrict access to files in the
procfs inside a container about which you know nothing.

> In any case, we are currently virtualizing:
> /proc/cpuinfo
> /proc/diskstats
> /proc/loadavg
> /proc/meminfo
> /proc/stat
> /proc/swaps
> /proc/uptime
> for each container with a tiny in-userspace filesystem LXCFS
> ( https://github.com/lxc/lxcfs )
> and have been doing that for years.

I know about it. The reason for the appearance of such a solution is also
clear.

> Having meminfo and cpuinfo virtualized in procfs was something we have
> been wanting for a long time and there have been patches by other people
> (from Siteground, I believe) to achieve this a few years back but were
> disregarded.
> 
> I think meminfo and cpuinfo would already be great. And if we're
> virtualizing cpuinfo we also need to virtualize the cpu bits exposed in
> /proc/stat. It would also be great to virtualize /proc/uptime. Right now
> we're achieving this essentially by substracting the time the init
> process of the pid namespace has started since system boot time, minus
> the time when the system started to get the actual reaper age (It's a
> bit more involved but that's the gist.).
> 
> This is all on the topic list for this year's virtual container's
> microconference at Plumber's and I would suggest we try to discuss the
> various requirements for something like this there. (I'm about to send
> the CFP out.)
> 
> Christian
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] proc: use subset option to hide some top-level procfs entries
  2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
@ 2020-06-05  2:19   ` kernel test robot
  2020-06-05  2:19   ` [PATCH] proc: fix boolreturn.cocci warnings kernel test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2020-06-05  2:19 UTC (permalink / raw)
  To: Alexey Gladkov, LKML
  Cc: kbuild-all, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Alexey Gladkov, Andy Lutomirski, Eric W . Biederman, Kees Cook

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

Hi Alexey,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20200604]
[cannot apply to kees/for-next/pstore linus/master linux/master v5.7 v5.7-rc7 v5.7-rc6 v5.7]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Alexey-Gladkov/proc-use-subset-option-to-hide-some-top-level-procfs-entries/20200605-040818
base:    d4899e5542c15062cc55cac0ca99025bb64edc61
config: arm64-randconfig-c003-20200604 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


coccinelle warnings: (new ones prefixed by >>)

>> fs/proc/generic.c:204:9-10: WARNING: return of 0/1 in function 'is_pde_visible' with return type bool

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35590 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] proc: fix boolreturn.cocci warnings
  2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
  2020-06-05  2:19   ` kernel test robot
@ 2020-06-05  2:19   ` kernel test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2020-06-05  2:19 UTC (permalink / raw)
  To: Alexey Gladkov, LKML
  Cc: kbuild-all, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Alexey Gladkov, Andy Lutomirski, Eric W . Biederman, Kees Cook

From: kernel test robot <lkp@intel.com>

fs/proc/generic.c:204:9-10: WARNING: return of 0/1 in function 'is_pde_visible' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

CC: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---

url:    https://github.com/0day-ci/linux/commits/Alexey-Gladkov/proc-use-subset-option-to-hide-some-top-level-procfs-entries/20200605-040818
base:    d4899e5542c15062cc55cac0ca99025bb64edc61

 generic.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -201,7 +201,7 @@ bool is_pde_visible(struct proc_fs_info
 	int i;
 
 	if (!fs_info->whitelist || pde == &proc_root)
-		return 1;
+		return true;
 
 	read_lock(&proc_subdir_lock);
 
@@ -218,7 +218,7 @@ bool is_pde_visible(struct proc_fs_info
 
 		if (ent == pde) {
 			read_unlock(&proc_subdir_lock);
-			return 1;
+			return true;
 		}
 
 		if (!S_ISDIR(ent->mode))
@@ -228,14 +228,14 @@ bool is_pde_visible(struct proc_fs_info
 		while (de != &proc_root) {
 			if (ent == de) {
 				read_unlock(&proc_subdir_lock);
-				return 1;
+				return true;
 			}
 			de = de->parent;
 		}
 	}
 
 	read_unlock(&proc_subdir_lock);
-	return 0;
+	return false;
 }
 
 static DEFINE_IDA(proc_inum_ida);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-05  0:08   ` Alexey Gladkov
@ 2020-06-05  4:17     ` Eric W. Biederman
  2020-06-05 14:47       ` Alexey Gladkov
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2020-06-05  4:17 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Andy Lutomirski, Kees Cook, Linux Containers

Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> On Thu, Jun 04, 2020 at 03:33:25PM -0500, Eric W. Biederman wrote:
>> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
>> 
>> > Greetings!
>> >
>> > Preface
>> > -------
>> > This patch set can be applied over:
>> >
>> > git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git d35bec8a5788
>> 
>> I am not going to seriously look at this for merging until after the
>> merge window closes. 
>
> OK. I'll wait.

That will mean your patches can be based on -rc1.

>> Have you thought about the possibility of relaxing the permission checks
>> to mount proc such that we don't need to verify there is an existing
>> mount of proc?  With just the subset pids I think this is feasible.  It
>> might not be worth it at this point, but it is definitely worth asking
>> the question.  As one of the benefits early propopents of the idea of a
>> subset of proc touted was that they would not be as restricted as they
>> are with today's proc.
>
> I'm not sure I follow.
>
> What do you mean by the possibility of relaxing the permission checks to
> mount proc?
>
> Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
> options? If so then this is an interesting idea.

The key part would be subset=pid.  You would still need to be root in
your user namespace, and mount namespace.  You would not need to have a
separate copy of proc with nothing hidden already mounted.

>> I ask because this has a bearing on the other options you are playing
>> with.
>
> I can not agree with this because I do not touch on other options.
> The hidepid and subset=pid has no relation to the visibility of regular
> files. On the other hand, in procfs there is absolutely no way to restrict
> access other than selinux.

Untrue.  At a practical level the user namespace greatly restricts
access to proc because many of the non-process files are limited to
global root only.

>> Do we want to find a way to have the benefit of relaxed permission
>> checks while still including a few more files.
>
> In fact, I see no problem allowing the user to mount procfs with the
> hidepid=2,subset=pid options.
>
> We can make subset=self, which would allow not only pids subset but also
> other symlinks that lead to self (/proc/net, /proc/mounts) and if we ever
> add virtualization to meminfo, cpuinfo etc.
>
>> > Overview
>> > --------
>> > Directories and files can be created and deleted by dynamically loaded modules.
>> > Not all of these files are virtualized and safe inside the container.
>> >
>> > However, subset=pid is not enough because many containers wants to have
>> > /proc/meminfo, /proc/cpuinfo, etc. We need a way to limit the visibility of
>> > files per procfs mountpoint.
>> 
>> Is it desirable to have meminfo and cpuinfo as they are today or do
>> people want them to reflect the ``container'' context.   So that
>> applications like the JVM don't allocation too many cpus or don't try
>> and consume too much memory, or run on nodes that cgroups current make
>> unavailable.
>
> Of course, it would be better if these files took into account the
> limitations of cgroups or some kind of ``containerized'' context.
>
>> Are there any users or planned users of this functionality yet?
>
> I know that java uses meminfo for sure.
>
> The purpose of this patch is to isolate the container from unwanted files
> in procfs.

If what we want is the ability not to use the original but to have
a modified version of these files.  We probably want empty files that
serve as mount points.

Or possibly a version of these files that takes into account
restrictions.  In either even we need to do the research through real
programs and real kernel options to see what is our best option for
exporting the limitations that programs have and deciding on the long
term API for that.

If we research things and we decide the best way to let java know of
it's limitations is to change /proc/meminfo.  That needs to be a change
that always applies to meminfo and is not controlled by options.

>> I am concerned that you might be adding functionality that no one will
>> ever use that will just add code to the kernel that no one cares about,
>> that will then accumulate bugs.  Having had to work through a few of
>> those cases to make each mount of proc have it's own super block I am
>> not a great fan of adding another one.
>>
>> If the runc, lxc and other container runtime folks can productively use
>> such and option to do useful things and they are sensible things to do I
>> don't have any fundamental objection.  But I do want to be certain this
>> is a feature that is going to be used.
>
> Ok, just an example how docker or runc (actually almost all golang-based
> container systems) is trying to block access to something in procfs:
>
> $ docker run -it --rm busybox
> # mount |grep /proc
> proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
> proc on /proc/bus type proc (ro,relatime)
> proc on /proc/fs type proc (ro,relatime)
> proc on /proc/irq type proc (ro,relatime)
> proc on /proc/sys type proc (ro,relatime)
> proc on /proc/sysrq-trigger type proc (ro,relatime)
> tmpfs on /proc/asound type tmpfs (ro,seclabel,relatime)
> tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
> tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
> tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
>
> For now I'm just trying ti create a better way to restrict access in
> the procfs than this since procfs is used in containers.

Docker historically has been crap about having a sensible policy.  The
problem is that Docker wanted to allow real root in a container and
somehow make it safe by blocking access to proc files and by dropping
capabilities.

Practically everything that Docker has done is much better and simpler by
restricting the processes to a user namespace, with a root user whose
uid is not the global root user.

Which is why I want us to make certain we are doing something that makes
sense, and is architecturally sound.

You have cleared the big hurdle and proc now has options that are
usable.   I really appreciate that.  I am not opposed to the general
direction you are going to find a way to make proc more usable.  I just
want our next step to be solid.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] proc: use subset option to hide some top-level procfs entries
  2020-06-05  4:17     ` Eric W. Biederman
@ 2020-06-05 14:47       ` Alexey Gladkov
  0 siblings, 0 replies; 11+ messages in thread
From: Alexey Gladkov @ 2020-06-05 14:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexander Viro, Alexey Dobriyan,
	Andy Lutomirski, Kees Cook, Linux Containers

On Thu, Jun 04, 2020 at 11:17:38PM -0500, Eric W. Biederman wrote:
> >> I am not going to seriously look at this for merging until after the
> >> merge window closes. 
> >
> > OK. I'll wait.
> 
> That will mean your patches can be based on -rc1.

OK.

> > Do you suggest to allow a user to mount procfs with hidepid=2,subset=pid
> > options? If so then this is an interesting idea.
> 
> The key part would be subset=pid.  You would still need to be root in
> your user namespace, and mount namespace.  You would not need to have a
> separate copy of proc with nothing hidden already mounted.

Can you tell me more about your idea ? I thought I understood it, but it
seems my understanding is different.

I thought that you are suggesting that you move in the direction of
allowing procfs to mount an unprivileged user.

> > I can not agree with this because I do not touch on other options.
> > The hidepid and subset=pid has no relation to the visibility of regular
> > files. On the other hand, in procfs there is absolutely no way to restrict
> > access other than selinux.
> 
> Untrue.  At a practical level the user namespace greatly restricts
> access to proc because many of the non-process files are limited to
> global root only.

I am not worried about the files created in procfs by the kernel itself
because the permissions are set correctly and are checked correctly.

I worry about kernel modules, especially about modules out of tree.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/gadget/function/rndis.c#n904

I certainly understand that 0660 is not 0666, but still.

> > I know that java uses meminfo for sure.
> >
> > The purpose of this patch is to isolate the container from unwanted files
> > in procfs.
> 
> If what we want is the ability not to use the original but to have
> a modified version of these files.  We probably want empty files that
> serve as mount points.
> 
> Or possibly a version of these files that takes into account
> restrictions.  In either even we need to do the research through real
> programs and real kernel options to see what is our best option for
> exporting the limitations that programs have and deciding on the long
> term API for that.

Yes, but that's a slightly different story. It would be great if all of
these files provide modified information.

My patch is about those files that we don’t know about and which we don’t
want.

> If we research things and we decide the best way to let java know of
> it's limitations is to change /proc/meminfo.  That needs to be a change
> that always applies to meminfo and is not controlled by options.
> 
> > For now I'm just trying ti create a better way to restrict access in
> > the procfs than this since procfs is used in containers.
> 
> Docker historically has been crap about having a sensible policy.  The
> problem is that Docker wanted to allow real root in a container and
> somehow make it safe by blocking access to proc files and by dropping
> capabilities.
> 
> Practically everything that Docker has done is much better and simpler by
> restricting the processes to a user namespace, with a root user whose
> uid is not the global root user.
> 
> Which is why I want us to make certain we are doing something that makes
> sense, and is architecturally sound.

Ok. Then ignore this patchset.

> You have cleared the big hurdle and proc now has options that are
> usable.   I really appreciate that.  I am not opposed to the general
> direction you are going to find a way to make proc more usable.  I just
> want our next step to be solid.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-05 14:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-04 20:04 [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Alexey Gladkov
2020-06-04 20:04 ` [PATCH 1/2] " Alexey Gladkov
2020-06-05  2:19   ` kernel test robot
2020-06-05  2:19   ` [PATCH] proc: fix boolreturn.cocci warnings kernel test robot
2020-06-04 20:04 ` [PATCH 2/2] docs: proc: update documentation about subset= parameter Alexey Gladkov
2020-06-04 20:33 ` [PATCH 0/2] proc: use subset option to hide some top-level procfs entries Eric W. Biederman
2020-06-04 21:32   ` Christian Brauner
2020-06-05  0:28     ` Alexey Gladkov
2020-06-05  0:08   ` Alexey Gladkov
2020-06-05  4:17     ` Eric W. Biederman
2020-06-05 14:47       ` Alexey Gladkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).