All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Add pidfs filesystem
@ 2017-02-18 22:53 Alexey Gladkov
  2017-02-18 23:34 ` kbuild test robot
                   ` (4 more replies)
  0 siblings, 5 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-02-18 22:53 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov

The pidfs filesystem contains a subset of the /proc file system which
contains only information about the processes.

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an addon to /proc, flags applied to
/proc have an effect on this pidfs filesystem.

Why not implement it as another flag to /proc ?

The /proc flags is stored in the pid_namespace and are global for
namespace. It means that if you add a flag to hide all except the pids,
then it will act on all mounted instances of /proc.

Originally the idea was that the container will be mounted only pidfs
and additional required files will be mounted on top using the
overlayfs. But I found out that /proc does not support overlayfs and
does not allow to mount anything on top or under it.

My question is whether it's possible to add overlayfs support for /proc?

Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 Documentation/filesystems/pidfs.txt | 16 ++++++++
 fs/proc/Kconfig                     |  8 ++++
 fs/proc/inode.c                     |  8 +++-
 fs/proc/internal.h                  |  2 +
 fs/proc/root.c                      | 76 ++++++++++++++++++++++++++++++++++---
 fs/proc/self.c                      |  6 +++
 fs/proc/thread_self.c               |  6 +++
 include/linux/pid_namespace.h       |  5 +++
 8 files changed, 119 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/filesystems/pidfs.txt

diff --git a/Documentation/filesystems/pidfs.txt b/Documentation/filesystems/pidfs.txt
new file mode 100644
index 0000000..ce958a5
--- /dev/null
+++ b/Documentation/filesystems/pidfs.txt
@@ -0,0 +1,16 @@
+The PIDFS Filesystem
+====================
+
+The pidfs filesystem contains a subset of the /proc file system which contains
+only information about the processes. The link self points to the process
+reading the file system. All other special files and directories in /proc are
+not available in this filesystem.
+
+The pidfs is not an independent filesystem, its implementation shares code
+with /proc.
+
+All mount options applicable to /proc filesystem are also applicable
+to pidfs filesystem. For example, access to the information in /proc/[pid]
+directories can be restricted using hidepid option.
+
+To get more information about the processes read the proc.txt
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade120..fa568f6 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,14 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_PIDFS
+	bool "pidfs file system support"
+	depends on PROC_FS
+	default n
+	help
+	  The pidfs filesystem contains a subset of the /proc file system
+	  which contains only information only about the processes.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..1be65b4 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -474,12 +474,16 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
 	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct proc_dir_entry *fs_root = &proc_root;
 	struct inode *root_inode;
 	int ret;
 
 	if (!proc_parse_options(data, ns))
 		return -EINVAL;
 
+	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type)
+		fs_root = &pidfs_root;
+
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
 	s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
@@ -496,8 +500,8 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
 	
-	pde_get(&proc_root);
-	root_inode = proc_get_inode(s, &proc_root);
+	pde_get(fs_root);
+	root_inode = proc_get_inode(s, fs_root);
 	if (!root_inode) {
 		pr_err("proc_fill_super: get root inode failed\n");
 		return -ENOMEM;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194..a7c068c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -267,6 +267,8 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+extern struct file_system_type pidfs_fs_type;
+extern struct proc_dir_entry pidfs_root;
 extern struct proc_dir_entry proc_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4bd0373..de16ac1 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -102,10 +102,21 @@ static void proc_kill_sb(struct super_block *sb)
 	struct pid_namespace *ns;
 
 	ns = (struct pid_namespace *)sb->s_fs_info;
-	if (ns->proc_self)
-		dput(ns->proc_self);
-	if (ns->proc_thread_self)
-		dput(ns->proc_thread_self);
+
+	if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
+		if (ns->pidfs_self)
+			dput(ns->pidfs_self);
+
+		if (ns->pidfs_thread_self)
+			dput(ns->pidfs_thread_self);
+	} else {
+		if (ns->proc_self)
+			dput(ns->proc_self);
+
+		if (ns->proc_thread_self)
+			dput(ns->proc_thread_self);
+	}
+
 	kill_anon_super(sb);
 	put_pid_ns(ns);
 }
@@ -117,6 +128,13 @@ static struct file_system_type proc_fs_type = {
 	.fs_flags	= FS_USERNS_MOUNT,
 };
 
+struct file_system_type pidfs_fs_type = {
+	.name		= "pidfs",
+	.mount		= proc_mount,
+	.kill_sb	= proc_kill_sb,
+	.fs_flags	= FS_USERNS_MOUNT,
+};
+
 void __init proc_root_init(void)
 {
 	int err;
@@ -127,6 +145,10 @@ void __init proc_root_init(void)
 	if (err)
 		return;
 
+	err = register_filesystem(&pidfs_fs_type);
+	if (err)
+		return;
+
 	proc_self_init();
 	proc_thread_self_init();
 	proc_symlink("mounts", NULL, "self/mounts");
@@ -148,8 +170,7 @@ void __init proc_root_init(void)
 	proc_sys_init();
 }
 
-static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat
-)
+static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
 	generic_fillattr(d_inode(dentry), stat);
 	stat->nlink = proc_root.nlink + nr_processes();
@@ -176,6 +197,14 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 	return proc_pid_readdir(file, ctx);
 }
 
+static int pidfs_root_readdir(struct file *file, struct dir_context *ctx)
+{
+	if (ctx->pos < FIRST_PROCESS_ENTRY)
+		ctx->pos = FIRST_PROCESS_ENTRY;
+
+	return proc_pid_readdir(file, ctx);
+}
+
 /*
  * The root /proc directory is special, as it has the
  * <pid> directories. Thus we don't use the generic
@@ -187,6 +216,12 @@ static const struct file_operations proc_root_operations = {
 	.llseek		= generic_file_llseek,
 };
 
+static const struct file_operations pidfs_root_operations = {
+	.read		 = generic_read_dir,
+	.iterate_shared	 = pidfs_root_readdir,
+	.llseek		= generic_file_llseek,
+};
+
 /*
  * proc root can do almost nothing..
  */
@@ -195,6 +230,11 @@ static const struct inode_operations proc_root_inode_operations = {
 	.getattr	= proc_root_getattr,
 };
 
+static const struct inode_operations pidfs_root_inode_operations = {
+	.lookup		= proc_pid_lookup,
+	.getattr	= proc_root_getattr,
+};
+
 /*
  * This is the root "inode" in the /proc tree..
  */
@@ -211,6 +251,19 @@ struct proc_dir_entry proc_root = {
 	.name		= "/proc",
 };
 
+struct proc_dir_entry pidfs_root = {
+	.low_ino	= PROC_ROOT_INO,
+	.namelen	= 6,
+	.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
+	.nlink		= 2,
+	.count		= ATOMIC_INIT(1),
+	.proc_iops	= &pidfs_root_inode_operations,
+	.proc_fops	= &pidfs_root_operations,
+	.parent		= &pidfs_root,
+	.subdir		= RB_ROOT,
+	.name		= "/pidfs",
+};
+
 int pid_ns_prepare_proc(struct pid_namespace *ns)
 {
 	struct vfsmount *mnt;
@@ -220,10 +273,21 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
 		return PTR_ERR(mnt);
 
 	ns->proc_mnt = mnt;
+
+	if (IS_ENABLED(CONFIG_PROC_PIDFS)) {
+		mnt = kern_mount_data(&pidfs_fs_type, ns);
+		if (IS_ERR(mnt))
+			return PTR_ERR(mnt);
+
+		ns->pidfs_mnt = mnt;
+	}
 	return 0;
 }
 
 void pid_ns_release_proc(struct pid_namespace *ns)
 {
 	kern_unmount(ns->proc_mnt);
+
+	if (IS_ENABLED(CONFIG_PROC_PIDFS))
+		kern_unmount(ns->pidfs_mnt);
 }
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 4024595..dea7e17 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -74,6 +74,12 @@ int proc_setup_self(struct super_block *s)
 		pr_err("proc_fill_super: can't allocate /proc/self\n");
 		return PTR_ERR(self);
 	}
+
+	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
+		ns->pidfs_self = self;
+		return 0;
+	}
+
 	ns->proc_self = self;
 	return 0;
 }
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 595b90a97..274c618 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -76,6 +76,12 @@ int proc_setup_thread_self(struct super_block *s)
 		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
 		return PTR_ERR(thread_self);
 	}
+
+	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
+		ns->pidfs_thread_self = thread_self;
+		return 0;
+	}
+
 	ns->proc_thread_self = thread_self;
 	return 0;
 }
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 34cce96..fca3a76 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -46,6 +46,11 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+#ifdef CONFIG_PROC_PIDFS
+	struct vfsmount *pidfs_mnt;
+	struct dentry *pidfs_self;
+	struct dentry *pidfs_thread_self;
+#endif
 };
 
 extern struct pid_namespace init_pid_ns;
-- 
2.10.2


-- 
Rgrds, legion

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
@ 2017-02-18 23:34 ` kbuild test robot
  2017-02-18 23:34 ` kbuild test robot
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: kbuild test robot @ 2017-02-18 23:34 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: kbuild-all, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 8180 bytes --]

Hi Alexey,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc8 next-20170217]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Add-pidfs-filesystem/20170219-070129
config: x86_64-randconfig-x003-201708 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/uapi/linux/capability.h:16,
                    from include/linux/capability.h:15,
                    from include/linux/sched.h:15,
                    from include/linux/uaccess.h:4,
                    from fs/proc/root.c:9:
   fs/proc/root.c: In function 'proc_kill_sb':
   fs/proc/root.c:107:9: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
      if (ns->pidfs_self)
            ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> fs/proc/root.c:107:3: note: in expansion of macro 'if'
      if (ns->pidfs_self)
      ^~
   fs/proc/root.c:107:9: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
      if (ns->pidfs_self)
            ^
   include/linux/compiler.h:149:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> fs/proc/root.c:107:3: note: in expansion of macro 'if'
      if (ns->pidfs_self)
      ^~
   fs/proc/root.c:107:9: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
      if (ns->pidfs_self)
            ^
   include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> fs/proc/root.c:107:3: note: in expansion of macro 'if'
      if (ns->pidfs_self)
      ^~
   fs/proc/root.c:108:11: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
       dput(ns->pidfs_self);
              ^~
   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/uapi/linux/capability.h:16,
                    from include/linux/capability.h:15,
                    from include/linux/sched.h:15,
                    from include/linux/uaccess.h:4,
                    from fs/proc/root.c:9:
   fs/proc/root.c:110:9: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
      if (ns->pidfs_thread_self)
            ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
   fs/proc/root.c:110:3: note: in expansion of macro 'if'
      if (ns->pidfs_thread_self)
      ^~
   fs/proc/root.c:110:9: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
      if (ns->pidfs_thread_self)
            ^
   include/linux/compiler.h:149:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
   fs/proc/root.c:110:3: note: in expansion of macro 'if'
      if (ns->pidfs_thread_self)
      ^~
   fs/proc/root.c:110:9: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
      if (ns->pidfs_thread_self)
            ^
   include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
   fs/proc/root.c:110:3: note: in expansion of macro 'if'
      if (ns->pidfs_thread_self)
      ^~
   fs/proc/root.c:111:11: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
       dput(ns->pidfs_thread_self);
              ^~
   fs/proc/root.c: In function 'pid_ns_prepare_proc':
   fs/proc/root.c:282:5: error: 'struct pid_namespace' has no member named 'pidfs_mnt'; did you mean 'proc_mnt'?
      ns->pidfs_mnt = mnt;
        ^~
   fs/proc/root.c: In function 'pid_ns_release_proc':
   fs/proc/root.c:292:18: error: 'struct pid_namespace' has no member named 'pidfs_mnt'; did you mean 'proc_mnt'?
      kern_unmount(ns->pidfs_mnt);
                     ^~

vim +/if +107 fs/proc/root.c

     3	 *
     4	 *  Copyright (C) 1991, 1992 Linus Torvalds
     5	 *
     6	 *  proc root directory handling functions
     7	 */
     8	
   > 9	#include <linux/uaccess.h>
    10	
    11	#include <linux/errno.h>
    12	#include <linux/time.h>
    13	#include <linux/proc_fs.h>
    14	#include <linux/stat.h>
    15	#include <linux/init.h>
    16	#include <linux/sched.h>
    17	#include <linux/module.h>
    18	#include <linux/bitops.h>
    19	#include <linux/user_namespace.h>
    20	#include <linux/mount.h>
    21	#include <linux/pid_namespace.h>
    22	#include <linux/parser.h>
    23	
    24	#include "internal.h"
    25	
    26	enum {
    27		Opt_gid, Opt_hidepid, Opt_err,
    28	};
    29	
    30	static const match_table_t tokens = {
    31		{Opt_hidepid, "hidepid=%u"},
    32		{Opt_gid, "gid=%u"},
    33		{Opt_err, NULL},
    34	};
    35	
    36	int proc_parse_options(char *options, struct pid_namespace *pid)
    37	{
    38		char *p;
    39		substring_t args[MAX_OPT_ARGS];
    40		int option;
    41	
    42		if (!options)
    43			return 1;
    44	
    45		while ((p = strsep(&options, ",")) != NULL) {
    46			int token;
    47			if (!*p)
    48				continue;
    49	
    50			args[0].to = args[0].from = NULL;
    51			token = match_token(p, tokens, args);
    52			switch (token) {
    53			case Opt_gid:
    54				if (match_int(&args[0], &option))
    55					return 0;
    56				pid->pid_gid = make_kgid(current_user_ns(), option);
    57				break;
    58			case Opt_hidepid:
    59				if (match_int(&args[0], &option))
    60					return 0;
    61				if (option < 0 || option > 2) {
    62					pr_err("proc: hidepid value must be between 0 and 2.\n");
    63					return 0;
    64				}
    65				pid->hide_pid = option;
    66				break;
    67			default:
    68				pr_err("proc: unrecognized mount option \"%s\" "
    69				       "or missing value\n", p);
    70				return 0;
    71			}
    72		}
    73	
    74		return 1;
    75	}
    76	
    77	int proc_remount(struct super_block *sb, int *flags, char *data)
    78	{
    79		struct pid_namespace *pid = sb->s_fs_info;
    80	
    81		sync_filesystem(sb);
    82		return !proc_parse_options(data, pid);
    83	}
    84	
    85	static struct dentry *proc_mount(struct file_system_type *fs_type,
    86		int flags, const char *dev_name, void *data)
    87	{
    88		struct pid_namespace *ns;
    89	
    90		if (flags & MS_KERNMOUNT) {
    91			ns = data;
    92			data = NULL;
    93		} else {
    94			ns = task_active_pid_ns(current);
    95		}
    96	
    97		return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
    98	}
    99	
   100	static void proc_kill_sb(struct super_block *sb)
   101	{
   102		struct pid_namespace *ns;
   103	
   104		ns = (struct pid_namespace *)sb->s_fs_info;
   105	
   106		if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
 > 107			if (ns->pidfs_self)
   108				dput(ns->pidfs_self);
   109	
   110			if (ns->pidfs_thread_self)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23447 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
  2017-02-18 23:34 ` kbuild test robot
@ 2017-02-18 23:34 ` kbuild test robot
  2017-02-20  4:05 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 53+ messages in thread
From: kbuild test robot @ 2017-02-18 23:34 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: kbuild-all, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 7813 bytes --]

Hi Alexey,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc8 next-20170217]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexey-Gladkov/Add-pidfs-filesystem/20170219-070129
config: x86_64-randconfig-x001-201708 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/proc/root.c: In function 'proc_kill_sb':
>> fs/proc/root.c:107:9: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
      if (ns->pidfs_self)
            ^~
   fs/proc/root.c:108:11: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
       dput(ns->pidfs_self);
              ^~
>> fs/proc/root.c:110:9: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
      if (ns->pidfs_thread_self)
            ^~
   fs/proc/root.c:111:11: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
       dput(ns->pidfs_thread_self);
              ^~
   fs/proc/root.c: In function 'pid_ns_prepare_proc':
>> fs/proc/root.c:282:5: error: 'struct pid_namespace' has no member named 'pidfs_mnt'; did you mean 'proc_mnt'?
      ns->pidfs_mnt = mnt;
        ^~
   fs/proc/root.c: In function 'pid_ns_release_proc':
   fs/proc/root.c:292:18: error: 'struct pid_namespace' has no member named 'pidfs_mnt'; did you mean 'proc_mnt'?
      kern_unmount(ns->pidfs_mnt);
                     ^~
--
   fs/proc/self.c: In function 'proc_setup_self':
>> fs/proc/self.c:66:5: error: 'struct pid_namespace' has no member named 'pidfs_self'; did you mean 'proc_self'?
      ns->pidfs_self = self;
        ^~
--
   fs/proc/thread_self.c: In function 'proc_setup_thread_self':
>> fs/proc/thread_self.c:67:5: error: 'struct pid_namespace' has no member named 'pidfs_thread_self'; did you mean 'proc_thread_self'?
      ns->pidfs_thread_self = thread_self;
        ^~

vim +107 fs/proc/root.c

   101	{
   102		struct pid_namespace *ns;
   103	
   104		ns = (struct pid_namespace *)sb->s_fs_info;
   105	
   106		if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
 > 107			if (ns->pidfs_self)
   108				dput(ns->pidfs_self);
   109	
 > 110			if (ns->pidfs_thread_self)
   111				dput(ns->pidfs_thread_self);
   112		} else {
   113			if (ns->proc_self)
   114				dput(ns->proc_self);
   115	
   116			if (ns->proc_thread_self)
   117				dput(ns->proc_thread_self);
   118		}
   119	
   120		kill_anon_super(sb);
   121		put_pid_ns(ns);
   122	}
   123	
   124	static struct file_system_type proc_fs_type = {
   125		.name		= "proc",
   126		.mount		= proc_mount,
   127		.kill_sb	= proc_kill_sb,
   128		.fs_flags	= FS_USERNS_MOUNT,
   129	};
   130	
   131	struct file_system_type pidfs_fs_type = {
   132		.name		= "pidfs",
   133		.mount		= proc_mount,
   134		.kill_sb	= proc_kill_sb,
   135		.fs_flags	= FS_USERNS_MOUNT,
   136	};
   137	
   138	void __init proc_root_init(void)
   139	{
   140		int err;
   141	
   142		proc_init_inodecache();
   143		set_proc_pid_nlink();
   144		err = register_filesystem(&proc_fs_type);
   145		if (err)
   146			return;
   147	
   148		err = register_filesystem(&pidfs_fs_type);
   149		if (err)
   150			return;
   151	
   152		proc_self_init();
   153		proc_thread_self_init();
   154		proc_symlink("mounts", NULL, "self/mounts");
   155	
   156		proc_net_init();
   157	
   158	#ifdef CONFIG_SYSVIPC
   159		proc_mkdir("sysvipc", NULL);
   160	#endif
   161		proc_mkdir("fs", NULL);
   162		proc_mkdir("driver", NULL);
   163		proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
   164	#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
   165		/* just give it a mountpoint */
   166		proc_create_mount_point("openprom");
   167	#endif
   168		proc_tty_init();
   169		proc_mkdir("bus", NULL);
   170		proc_sys_init();
   171	}
   172	
   173	static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
   174	{
   175		generic_fillattr(d_inode(dentry), stat);
   176		stat->nlink = proc_root.nlink + nr_processes();
   177		return 0;
   178	}
   179	
   180	static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
   181	{
   182		if (!proc_pid_lookup(dir, dentry, flags))
   183			return NULL;
   184		
   185		return proc_lookup(dir, dentry, flags);
   186	}
   187	
   188	static int proc_root_readdir(struct file *file, struct dir_context *ctx)
   189	{
   190		if (ctx->pos < FIRST_PROCESS_ENTRY) {
   191			int error = proc_readdir(file, ctx);
   192			if (unlikely(error <= 0))
   193				return error;
   194			ctx->pos = FIRST_PROCESS_ENTRY;
   195		}
   196	
   197		return proc_pid_readdir(file, ctx);
   198	}
   199	
   200	static int pidfs_root_readdir(struct file *file, struct dir_context *ctx)
   201	{
   202		if (ctx->pos < FIRST_PROCESS_ENTRY)
   203			ctx->pos = FIRST_PROCESS_ENTRY;
   204	
   205		return proc_pid_readdir(file, ctx);
   206	}
   207	
   208	/*
   209	 * The root /proc directory is special, as it has the
   210	 * <pid> directories. Thus we don't use the generic
   211	 * directory handling functions for that..
   212	 */
   213	static const struct file_operations proc_root_operations = {
   214		.read		 = generic_read_dir,
   215		.iterate_shared	 = proc_root_readdir,
   216		.llseek		= generic_file_llseek,
   217	};
   218	
   219	static const struct file_operations pidfs_root_operations = {
   220		.read		 = generic_read_dir,
   221		.iterate_shared	 = pidfs_root_readdir,
   222		.llseek		= generic_file_llseek,
   223	};
   224	
   225	/*
   226	 * proc root can do almost nothing..
   227	 */
   228	static const struct inode_operations proc_root_inode_operations = {
   229		.lookup		= proc_root_lookup,
   230		.getattr	= proc_root_getattr,
   231	};
   232	
   233	static const struct inode_operations pidfs_root_inode_operations = {
   234		.lookup		= proc_pid_lookup,
   235		.getattr	= proc_root_getattr,
   236	};
   237	
   238	/*
   239	 * This is the root "inode" in the /proc tree..
   240	 */
   241	struct proc_dir_entry proc_root = {
   242		.low_ino	= PROC_ROOT_INO, 
   243		.namelen	= 5, 
   244		.mode		= S_IFDIR | S_IRUGO | S_IXUGO, 
   245		.nlink		= 2, 
   246		.count		= ATOMIC_INIT(1),
   247		.proc_iops	= &proc_root_inode_operations, 
   248		.proc_fops	= &proc_root_operations,
   249		.parent		= &proc_root,
   250		.subdir		= RB_ROOT,
   251		.name		= "/proc",
   252	};
   253	
   254	struct proc_dir_entry pidfs_root = {
   255		.low_ino	= PROC_ROOT_INO,
   256		.namelen	= 6,
   257		.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
   258		.nlink		= 2,
   259		.count		= ATOMIC_INIT(1),
   260		.proc_iops	= &pidfs_root_inode_operations,
   261		.proc_fops	= &pidfs_root_operations,
   262		.parent		= &pidfs_root,
   263		.subdir		= RB_ROOT,
   264		.name		= "/pidfs",
   265	};
   266	
   267	int pid_ns_prepare_proc(struct pid_namespace *ns)
   268	{
   269		struct vfsmount *mnt;
   270	
   271		mnt = kern_mount_data(&proc_fs_type, ns);
   272		if (IS_ERR(mnt))
   273			return PTR_ERR(mnt);
   274	
   275		ns->proc_mnt = mnt;
   276	
   277		if (IS_ENABLED(CONFIG_PROC_PIDFS)) {
   278			mnt = kern_mount_data(&pidfs_fs_type, ns);
   279			if (IS_ERR(mnt))
   280				return PTR_ERR(mnt);
   281	
 > 282			ns->pidfs_mnt = mnt;
   283		}
   284		return 0;
   285	}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25024 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
  2017-02-18 23:34 ` kbuild test robot
  2017-02-18 23:34 ` kbuild test robot
@ 2017-02-20  4:05 ` Eric W. Biederman
  2017-02-20 10:36   ` Alexey Gladkov
  2017-02-22 20:11   ` Richard Weinberger
  2017-02-21 14:57 ` Oleg Nesterov
  2017-02-27 18:56 ` [PATCH] Add pidfs filesystem Michael Kerrisk
  4 siblings, 2 replies; 53+ messages in thread
From: Eric W. Biederman @ 2017-02-20  4:05 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Oleg Nesterov, Pavel Emelyanov, Andy Lutomirski

Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> The pidfs filesystem contains a subset of the /proc file system which
> contains only information about the processes.

My summary of your motivation.

It hurts when I create a container with a processes with uid 0 inside of
it.  This generates lots of hacks to attempt to limit uid 0.

  My answer:  Don't run a container with a real uid 0 inside of it.

Any reasonable permission check will on proc files
will keep you safe if your container does not have a real uid 0 in it.


That said I am not opposed in principle to a pidfs.  And the idea of
using overlay for this purpose is intriguing.  I have not looked at it
in enough detail comment on the technical merits.

Eric


> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids
> without rest objects. Since this is an addon to /proc, flags applied to
> /proc have an effect on this pidfs filesystem.
>
> Why not implement it as another flag to /proc ?
>
> The /proc flags is stored in the pid_namespace and are global for
> namespace. It means that if you add a flag to hide all except the pids,
> then it will act on all mounted instances of /proc.
>
> Originally the idea was that the container will be mounted only pidfs
> and additional required files will be mounted on top using the
> overlayfs. But I found out that /proc does not support overlayfs and
> does not allow to mount anything on top or under it.
>
> My question is whether it's possible to add overlayfs support for /proc?
>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---
>  Documentation/filesystems/pidfs.txt | 16 ++++++++
>  fs/proc/Kconfig                     |  8 ++++
>  fs/proc/inode.c                     |  8 +++-
>  fs/proc/internal.h                  |  2 +
>  fs/proc/root.c                      | 76 ++++++++++++++++++++++++++++++++++---
>  fs/proc/self.c                      |  6 +++
>  fs/proc/thread_self.c               |  6 +++
>  include/linux/pid_namespace.h       |  5 +++
>  8 files changed, 119 insertions(+), 8 deletions(-)
>  create mode 100644 Documentation/filesystems/pidfs.txt
>
> diff --git a/Documentation/filesystems/pidfs.txt b/Documentation/filesystems/pidfs.txt
> new file mode 100644
> index 0000000..ce958a5
> --- /dev/null
> +++ b/Documentation/filesystems/pidfs.txt
> @@ -0,0 +1,16 @@
> +The PIDFS Filesystem
> +====================
> +
> +The pidfs filesystem contains a subset of the /proc file system which contains
> +only information about the processes. The link self points to the process
> +reading the file system. All other special files and directories in /proc are
> +not available in this filesystem.
> +
> +The pidfs is not an independent filesystem, its implementation shares code
> +with /proc.
> +
> +All mount options applicable to /proc filesystem are also applicable
> +to pidfs filesystem. For example, access to the information in /proc/[pid]
> +directories can be restricted using hidepid option.
> +
> +To get more information about the processes read the proc.txt
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 1ade120..fa568f6 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -43,6 +43,14 @@ config PROC_VMCORE
>          help
>          Exports the dump image of crashed kernel in ELF format.
>  
> +config PROC_PIDFS
> +	bool "pidfs file system support"
> +	depends on PROC_FS
> +	default n
> +	help
> +	  The pidfs filesystem contains a subset of the /proc file system
> +	  which contains only information only about the processes.
> +
>  config PROC_SYSCTL
>  	bool "Sysctl support (/proc/sys)" if EXPERT
>  	depends on PROC_FS
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 783bc19..1be65b4 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -474,12 +474,16 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
>  int proc_fill_super(struct super_block *s, void *data, int silent)
>  {
>  	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
> +	struct proc_dir_entry *fs_root = &proc_root;
>  	struct inode *root_inode;
>  	int ret;
>  
>  	if (!proc_parse_options(data, ns))
>  		return -EINVAL;
>  
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type)
> +		fs_root = &pidfs_root;
> +
>  	/* User space would break if executables or devices appear on proc */
>  	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
>  	s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
> @@ -496,8 +500,8 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
>  	 */
>  	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
>  	
> -	pde_get(&proc_root);
> -	root_inode = proc_get_inode(s, &proc_root);
> +	pde_get(fs_root);
> +	root_inode = proc_get_inode(s, fs_root);
>  	if (!root_inode) {
>  		pr_err("proc_fill_super: get root inode failed\n");
>  		return -ENOMEM;
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 2de5194..a7c068c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -267,6 +267,8 @@ static inline void proc_tty_init(void) {}
>  /*
>   * root.c
>   */
> +extern struct file_system_type pidfs_fs_type;
> +extern struct proc_dir_entry pidfs_root;
>  extern struct proc_dir_entry proc_root;
>  extern int proc_parse_options(char *options, struct pid_namespace *pid);
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 4bd0373..de16ac1 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -102,10 +102,21 @@ static void proc_kill_sb(struct super_block *sb)
>  	struct pid_namespace *ns;
>  
>  	ns = (struct pid_namespace *)sb->s_fs_info;
> -	if (ns->proc_self)
> -		dput(ns->proc_self);
> -	if (ns->proc_thread_self)
> -		dput(ns->proc_thread_self);
> +
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
> +		if (ns->pidfs_self)
> +			dput(ns->pidfs_self);
> +
> +		if (ns->pidfs_thread_self)
> +			dput(ns->pidfs_thread_self);
> +	} else {
> +		if (ns->proc_self)
> +			dput(ns->proc_self);
> +
> +		if (ns->proc_thread_self)
> +			dput(ns->proc_thread_self);
> +	}
> +
>  	kill_anon_super(sb);
>  	put_pid_ns(ns);
>  }
> @@ -117,6 +128,13 @@ static struct file_system_type proc_fs_type = {
>  	.fs_flags	= FS_USERNS_MOUNT,
>  };
>  
> +struct file_system_type pidfs_fs_type = {
> +	.name		= "pidfs",
> +	.mount		= proc_mount,
> +	.kill_sb	= proc_kill_sb,
> +	.fs_flags	= FS_USERNS_MOUNT,
> +};
> +
>  void __init proc_root_init(void)
>  {
>  	int err;
> @@ -127,6 +145,10 @@ void __init proc_root_init(void)
>  	if (err)
>  		return;
>  
> +	err = register_filesystem(&pidfs_fs_type);
> +	if (err)
> +		return;
> +
>  	proc_self_init();
>  	proc_thread_self_init();
>  	proc_symlink("mounts", NULL, "self/mounts");
> @@ -148,8 +170,7 @@ void __init proc_root_init(void)
>  	proc_sys_init();
>  }
>  
> -static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat
> -)
> +static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
>  {
>  	generic_fillattr(d_inode(dentry), stat);
>  	stat->nlink = proc_root.nlink + nr_processes();
> @@ -176,6 +197,14 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
>  	return proc_pid_readdir(file, ctx);
>  }
>  
> +static int pidfs_root_readdir(struct file *file, struct dir_context *ctx)
> +{
> +	if (ctx->pos < FIRST_PROCESS_ENTRY)
> +		ctx->pos = FIRST_PROCESS_ENTRY;
> +
> +	return proc_pid_readdir(file, ctx);
> +}
> +
>  /*
>   * The root /proc directory is special, as it has the
>   * <pid> directories. Thus we don't use the generic
> @@ -187,6 +216,12 @@ static const struct file_operations proc_root_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +static const struct file_operations pidfs_root_operations = {
> +	.read		 = generic_read_dir,
> +	.iterate_shared	 = pidfs_root_readdir,
> +	.llseek		= generic_file_llseek,
> +};
> +
>  /*
>   * proc root can do almost nothing..
>   */
> @@ -195,6 +230,11 @@ static const struct inode_operations proc_root_inode_operations = {
>  	.getattr	= proc_root_getattr,
>  };
>  
> +static const struct inode_operations pidfs_root_inode_operations = {
> +	.lookup		= proc_pid_lookup,
> +	.getattr	= proc_root_getattr,
> +};
> +
>  /*
>   * This is the root "inode" in the /proc tree..
>   */
> @@ -211,6 +251,19 @@ struct proc_dir_entry proc_root = {
>  	.name		= "/proc",
>  };
>  
> +struct proc_dir_entry pidfs_root = {
> +	.low_ino	= PROC_ROOT_INO,
> +	.namelen	= 6,
> +	.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
> +	.nlink		= 2,
> +	.count		= ATOMIC_INIT(1),
> +	.proc_iops	= &pidfs_root_inode_operations,
> +	.proc_fops	= &pidfs_root_operations,
> +	.parent		= &pidfs_root,
> +	.subdir		= RB_ROOT,
> +	.name		= "/pidfs",
> +};
> +
>  int pid_ns_prepare_proc(struct pid_namespace *ns)
>  {
>  	struct vfsmount *mnt;
> @@ -220,10 +273,21 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
>  		return PTR_ERR(mnt);
>  
>  	ns->proc_mnt = mnt;
> +
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS)) {
> +		mnt = kern_mount_data(&pidfs_fs_type, ns);
> +		if (IS_ERR(mnt))
> +			return PTR_ERR(mnt);
> +
> +		ns->pidfs_mnt = mnt;
> +	}
>  	return 0;
>  }
>  
>  void pid_ns_release_proc(struct pid_namespace *ns)
>  {
>  	kern_unmount(ns->proc_mnt);
> +
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS))
> +		kern_unmount(ns->pidfs_mnt);
>  }
> diff --git a/fs/proc/self.c b/fs/proc/self.c
> index 4024595..dea7e17 100644
> --- a/fs/proc/self.c
> +++ b/fs/proc/self.c
> @@ -74,6 +74,12 @@ int proc_setup_self(struct super_block *s)
>  		pr_err("proc_fill_super: can't allocate /proc/self\n");
>  		return PTR_ERR(self);
>  	}
> +
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> +		ns->pidfs_self = self;
> +		return 0;
> +	}
> +
>  	ns->proc_self = self;
>  	return 0;
>  }
> diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> index 595b90a97..274c618 100644
> --- a/fs/proc/thread_self.c
> +++ b/fs/proc/thread_self.c
> @@ -76,6 +76,12 @@ int proc_setup_thread_self(struct super_block *s)
>  		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
>  		return PTR_ERR(thread_self);
>  	}
> +
> +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> +		ns->pidfs_thread_self = thread_self;
> +		return 0;
> +	}
> +
>  	ns->proc_thread_self = thread_self;
>  	return 0;
>  }
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 34cce96..fca3a76 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -46,6 +46,11 @@ struct pid_namespace {
>  	int hide_pid;
>  	int reboot;	/* group exit code if this pidns was rebooted */
>  	struct ns_common ns;
> +#ifdef CONFIG_PROC_PIDFS
> +	struct vfsmount *pidfs_mnt;
> +	struct dentry *pidfs_self;
> +	struct dentry *pidfs_thread_self;
> +#endif
>  };
>  
>  extern struct pid_namespace init_pid_ns;
> -- 
> 2.10.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-20  4:05 ` Eric W. Biederman
@ 2017-02-20 10:36   ` Alexey Gladkov
  2017-02-22 20:11   ` Richard Weinberger
  1 sibling, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-02-20 10:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Oleg Nesterov, Pavel Emelyanov, Andy Lutomirski,
	Dmitry V. Levin

On Mon, Feb 20, 2017 at 05:05:18PM +1300, Eric W. Biederman wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > The pidfs filesystem contains a subset of the /proc file system which
> > contains only information about the processes.
> 
> My summary of your motivation.
> 
> It hurts when I create a container with a processes with uid 0 inside of
> it.  This generates lots of hacks to attempt to limit uid 0.
> 
>   My answer:  Don't run a container with a real uid 0 inside of it.
> 
> Any reasonable permission check will on proc files
> will keep you safe if your container does not have a real uid 0 in it.

Right now this approach is all we have for not virtualized objects in /proc.
We can only hope that those who create files in /proc are intelligent enough.

We can't protect ourselves from the appearance of something unwanted in /proc.

> That said I am not opposed in principle to a pidfs.  And the idea of
> using overlay for this purpose is intriguing.  I have not looked at it
> in enough detail comment on the technical merits.

I'm currently researching the way of adding support overlayfs for /proc.

But before going any further, I would like to discuss in the right direction
I think or not.

> > Some of the container virtualization systems are mounted /proc inside
> > the container. This is done in most cases to operate with information
> > about the processes. Knowing that /proc filesystem is not fully
> > virtualized they are mounted on top of dangerous places empty files or
> > directories (for exmaple /proc/kcore, /sys/firmware, etc.).
> >
> > The structure of this filesystem is dynamic and any module can create a
> > new object which will not necessarily be virtualized. There are
> > proprietary modules that aren't in the mainline whose work we can not
> > verify.
> >
> > This opens up a potential threat to the system. The developers of the
> > virtualization system can't predict all dangerous places in /proc by
> > definition.
> >
> > A more effective solution would be to mount into the container only what
> > is necessary and ignore the rest.
> >
> > Right now there is the opportunity to pass in the container any port of
> > the /proc filesystem using mount --bind expect the pids.
> >
> > This patch allows to mount only the part of /proc related to pids
> > without rest objects. Since this is an addon to /proc, flags applied to
> > /proc have an effect on this pidfs filesystem.
> >
> > Why not implement it as another flag to /proc ?
> >
> > The /proc flags is stored in the pid_namespace and are global for
> > namespace. It means that if you add a flag to hide all except the pids,
> > then it will act on all mounted instances of /proc.
> >
> > Originally the idea was that the container will be mounted only pidfs
> > and additional required files will be mounted on top using the
> > overlayfs. But I found out that /proc does not support overlayfs and
> > does not allow to mount anything on top or under it.
> >
> > My question is whether it's possible to add overlayfs support for /proc?
> >
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> > ---
> >  Documentation/filesystems/pidfs.txt | 16 ++++++++
> >  fs/proc/Kconfig                     |  8 ++++
> >  fs/proc/inode.c                     |  8 +++-
> >  fs/proc/internal.h                  |  2 +
> >  fs/proc/root.c                      | 76 ++++++++++++++++++++++++++++++++++---
> >  fs/proc/self.c                      |  6 +++
> >  fs/proc/thread_self.c               |  6 +++
> >  include/linux/pid_namespace.h       |  5 +++
> >  8 files changed, 119 insertions(+), 8 deletions(-)
> >  create mode 100644 Documentation/filesystems/pidfs.txt
> >
> > diff --git a/Documentation/filesystems/pidfs.txt b/Documentation/filesystems/pidfs.txt
> > new file mode 100644
> > index 0000000..ce958a5
> > --- /dev/null
> > +++ b/Documentation/filesystems/pidfs.txt
> > @@ -0,0 +1,16 @@
> > +The PIDFS Filesystem
> > +====================
> > +
> > +The pidfs filesystem contains a subset of the /proc file system which contains
> > +only information about the processes. The link self points to the process
> > +reading the file system. All other special files and directories in /proc are
> > +not available in this filesystem.
> > +
> > +The pidfs is not an independent filesystem, its implementation shares code
> > +with /proc.
> > +
> > +All mount options applicable to /proc filesystem are also applicable
> > +to pidfs filesystem. For example, access to the information in /proc/[pid]
> > +directories can be restricted using hidepid option.
> > +
> > +To get more information about the processes read the proc.txt
> > diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> > index 1ade120..fa568f6 100644
> > --- a/fs/proc/Kconfig
> > +++ b/fs/proc/Kconfig
> > @@ -43,6 +43,14 @@ config PROC_VMCORE
> >          help
> >          Exports the dump image of crashed kernel in ELF format.
> >  
> > +config PROC_PIDFS
> > +	bool "pidfs file system support"
> > +	depends on PROC_FS
> > +	default n
> > +	help
> > +	  The pidfs filesystem contains a subset of the /proc file system
> > +	  which contains only information only about the processes.
> > +
> >  config PROC_SYSCTL
> >  	bool "Sysctl support (/proc/sys)" if EXPERT
> >  	depends on PROC_FS
> > diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> > index 783bc19..1be65b4 100644
> > --- a/fs/proc/inode.c
> > +++ b/fs/proc/inode.c
> > @@ -474,12 +474,16 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
> >  int proc_fill_super(struct super_block *s, void *data, int silent)
> >  {
> >  	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
> > +	struct proc_dir_entry *fs_root = &proc_root;
> >  	struct inode *root_inode;
> >  	int ret;
> >  
> >  	if (!proc_parse_options(data, ns))
> >  		return -EINVAL;
> >  
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type)
> > +		fs_root = &pidfs_root;
> > +
> >  	/* User space would break if executables or devices appear on proc */
> >  	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
> >  	s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
> > @@ -496,8 +500,8 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
> >  	 */
> >  	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> >  	
> > -	pde_get(&proc_root);
> > -	root_inode = proc_get_inode(s, &proc_root);
> > +	pde_get(fs_root);
> > +	root_inode = proc_get_inode(s, fs_root);
> >  	if (!root_inode) {
> >  		pr_err("proc_fill_super: get root inode failed\n");
> >  		return -ENOMEM;
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 2de5194..a7c068c 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -267,6 +267,8 @@ static inline void proc_tty_init(void) {}
> >  /*
> >   * root.c
> >   */
> > +extern struct file_system_type pidfs_fs_type;
> > +extern struct proc_dir_entry pidfs_root;
> >  extern struct proc_dir_entry proc_root;
> >  extern int proc_parse_options(char *options, struct pid_namespace *pid);
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index 4bd0373..de16ac1 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -102,10 +102,21 @@ static void proc_kill_sb(struct super_block *sb)
> >  	struct pid_namespace *ns;
> >  
> >  	ns = (struct pid_namespace *)sb->s_fs_info;
> > -	if (ns->proc_self)
> > -		dput(ns->proc_self);
> > -	if (ns->proc_thread_self)
> > -		dput(ns->proc_thread_self);
> > +
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
> > +		if (ns->pidfs_self)
> > +			dput(ns->pidfs_self);
> > +
> > +		if (ns->pidfs_thread_self)
> > +			dput(ns->pidfs_thread_self);
> > +	} else {
> > +		if (ns->proc_self)
> > +			dput(ns->proc_self);
> > +
> > +		if (ns->proc_thread_self)
> > +			dput(ns->proc_thread_self);
> > +	}
> > +
> >  	kill_anon_super(sb);
> >  	put_pid_ns(ns);
> >  }
> > @@ -117,6 +128,13 @@ static struct file_system_type proc_fs_type = {
> >  	.fs_flags	= FS_USERNS_MOUNT,
> >  };
> >  
> > +struct file_system_type pidfs_fs_type = {
> > +	.name		= "pidfs",
> > +	.mount		= proc_mount,
> > +	.kill_sb	= proc_kill_sb,
> > +	.fs_flags	= FS_USERNS_MOUNT,
> > +};
> > +
> >  void __init proc_root_init(void)
> >  {
> >  	int err;
> > @@ -127,6 +145,10 @@ void __init proc_root_init(void)
> >  	if (err)
> >  		return;
> >  
> > +	err = register_filesystem(&pidfs_fs_type);
> > +	if (err)
> > +		return;
> > +
> >  	proc_self_init();
> >  	proc_thread_self_init();
> >  	proc_symlink("mounts", NULL, "self/mounts");
> > @@ -148,8 +170,7 @@ void __init proc_root_init(void)
> >  	proc_sys_init();
> >  }
> >  
> > -static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat
> > -)
> > +static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> >  {
> >  	generic_fillattr(d_inode(dentry), stat);
> >  	stat->nlink = proc_root.nlink + nr_processes();
> > @@ -176,6 +197,14 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> >  	return proc_pid_readdir(file, ctx);
> >  }
> >  
> > +static int pidfs_root_readdir(struct file *file, struct dir_context *ctx)
> > +{
> > +	if (ctx->pos < FIRST_PROCESS_ENTRY)
> > +		ctx->pos = FIRST_PROCESS_ENTRY;
> > +
> > +	return proc_pid_readdir(file, ctx);
> > +}
> > +
> >  /*
> >   * The root /proc directory is special, as it has the
> >   * <pid> directories. Thus we don't use the generic
> > @@ -187,6 +216,12 @@ static const struct file_operations proc_root_operations = {
> >  	.llseek		= generic_file_llseek,
> >  };
> >  
> > +static const struct file_operations pidfs_root_operations = {
> > +	.read		 = generic_read_dir,
> > +	.iterate_shared	 = pidfs_root_readdir,
> > +	.llseek		= generic_file_llseek,
> > +};
> > +
> >  /*
> >   * proc root can do almost nothing..
> >   */
> > @@ -195,6 +230,11 @@ static const struct inode_operations proc_root_inode_operations = {
> >  	.getattr	= proc_root_getattr,
> >  };
> >  
> > +static const struct inode_operations pidfs_root_inode_operations = {
> > +	.lookup		= proc_pid_lookup,
> > +	.getattr	= proc_root_getattr,
> > +};
> > +
> >  /*
> >   * This is the root "inode" in the /proc tree..
> >   */
> > @@ -211,6 +251,19 @@ struct proc_dir_entry proc_root = {
> >  	.name		= "/proc",
> >  };
> >  
> > +struct proc_dir_entry pidfs_root = {
> > +	.low_ino	= PROC_ROOT_INO,
> > +	.namelen	= 6,
> > +	.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
> > +	.nlink		= 2,
> > +	.count		= ATOMIC_INIT(1),
> > +	.proc_iops	= &pidfs_root_inode_operations,
> > +	.proc_fops	= &pidfs_root_operations,
> > +	.parent		= &pidfs_root,
> > +	.subdir		= RB_ROOT,
> > +	.name		= "/pidfs",
> > +};
> > +
> >  int pid_ns_prepare_proc(struct pid_namespace *ns)
> >  {
> >  	struct vfsmount *mnt;
> > @@ -220,10 +273,21 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
> >  		return PTR_ERR(mnt);
> >  
> >  	ns->proc_mnt = mnt;
> > +
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS)) {
> > +		mnt = kern_mount_data(&pidfs_fs_type, ns);
> > +		if (IS_ERR(mnt))
> > +			return PTR_ERR(mnt);
> > +
> > +		ns->pidfs_mnt = mnt;
> > +	}
> >  	return 0;
> >  }
> >  
> >  void pid_ns_release_proc(struct pid_namespace *ns)
> >  {
> >  	kern_unmount(ns->proc_mnt);
> > +
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS))
> > +		kern_unmount(ns->pidfs_mnt);
> >  }
> > diff --git a/fs/proc/self.c b/fs/proc/self.c
> > index 4024595..dea7e17 100644
> > --- a/fs/proc/self.c
> > +++ b/fs/proc/self.c
> > @@ -74,6 +74,12 @@ int proc_setup_self(struct super_block *s)
> >  		pr_err("proc_fill_super: can't allocate /proc/self\n");
> >  		return PTR_ERR(self);
> >  	}
> > +
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> > +		ns->pidfs_self = self;
> > +		return 0;
> > +	}
> > +
> >  	ns->proc_self = self;
> >  	return 0;
> >  }
> > diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> > index 595b90a97..274c618 100644
> > --- a/fs/proc/thread_self.c
> > +++ b/fs/proc/thread_self.c
> > @@ -76,6 +76,12 @@ int proc_setup_thread_self(struct super_block *s)
> >  		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
> >  		return PTR_ERR(thread_self);
> >  	}
> > +
> > +	if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> > +		ns->pidfs_thread_self = thread_self;
> > +		return 0;
> > +	}
> > +
> >  	ns->proc_thread_self = thread_self;
> >  	return 0;
> >  }
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 34cce96..fca3a76 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -46,6 +46,11 @@ struct pid_namespace {
> >  	int hide_pid;
> >  	int reboot;	/* group exit code if this pidns was rebooted */
> >  	struct ns_common ns;
> > +#ifdef CONFIG_PROC_PIDFS
> > +	struct vfsmount *pidfs_mnt;
> > +	struct dentry *pidfs_self;
> > +	struct dentry *pidfs_thread_self;
> > +#endif
> >  };
> >  
> >  extern struct pid_namespace init_pid_ns;
> > -- 
> > 2.10.2

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
                   ` (2 preceding siblings ...)
  2017-02-20  4:05 ` Eric W. Biederman
@ 2017-02-21 14:57 ` Oleg Nesterov
  2017-02-22  7:40   ` Pavel Emelyanov
                     ` (3 more replies)
  2017-02-27 18:56 ` [PATCH] Add pidfs filesystem Michael Kerrisk
  4 siblings, 4 replies; 53+ messages in thread
From: Oleg Nesterov @ 2017-02-21 14:57 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Eric W. Biederman, Pavel Emelyanov

On 02/18, Alexey Gladkov wrote:
>
> This patch allows to mount only the part of /proc related to pids
> without rest objects. Since this is an addon to /proc, flags applied to
> /proc have an effect on this pidfs filesystem.

I leave this to you and Eric, but imo it would be nice to avoid another
filesystem.

> Why not implement it as another flag to /proc ?
>
> The /proc flags is stored in the pid_namespace and are global for
> namespace. It means that if you add a flag to hide all except the pids,
> then it will act on all mounted instances of /proc.

But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
the simple patch below. Not sure it is correct/complete, just to illustrate
the idea.

With this patch you can mount proc with -onodev and it will only show
pids/self/thread_self:

	# mkdir /tmp/D
	# mount -t proc -o nodev none /tmp/D
	# ls /tmp/D
	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
	# cat /tmp/D/meminfo
	cat: /tmp/D/meminfo: No such file or directory
	# ls /tmp/D/irq
	ls: cannot open directory /tmp/D/irq: No such file or directory

No?

Oleg.


--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
 
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
+	int mnt_flags = file->f_path.mnt->mnt_flags;
 	struct inode *inode = file_inode(file);
 
+	if (mnt_flags & MNT_NODEV)
+		return 1;
+
 	return proc_readdir_de(PDE(inode), file, ctx);
 }
 
+static int proc_dir_open(struct inode *inode, struct file *file)
+{
+	if (file->f_path.mnt->mnt_flags & MNT_NODEV)
+		return -ENOENT;
+	return 0;
+}
+
 /*
  * These are the generic /proc directory operations. They
  * use the in-memory "struct proc_dir_entry" tree to parse
@@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = {
 	.llseek			= generic_file_llseek,
 	.read			= generic_read_dir,
 	.iterate_shared		= proc_readdir,
+	.open			= proc_dir_open,
 };
 
 /*
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
 
 static int proc_reg_open(struct inode *inode, struct file *file)
 {
+	int mnt_flags = file->f_path.mnt->mnt_flags;
 	struct proc_dir_entry *pde = PDE(inode);
 	int rv = 0;
 	int (*open)(struct inode *, struct file *);
 	int (*release)(struct inode *, struct file *);
 	struct pde_opener *pdeo;
 
+	if (mnt_flags & MNT_NODEV)
+		return -ENOENT;
+
 	/*
 	 * Ensure that
 	 * 1) PDE's ->release hook will be called no matter what

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-21 14:57 ` Oleg Nesterov
@ 2017-02-22  7:40   ` Pavel Emelyanov
  2017-02-22 12:04     ` Alexey Gladkov
  2017-02-22 11:53   ` Alexey Gladkov
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Pavel Emelyanov @ 2017-02-22  7:40 UTC (permalink / raw)
  To: Oleg Nesterov, Alexey Gladkov
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Eric W. Biederman

On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
> On 02/18, Alexey Gladkov wrote:
>>
>> This patch allows to mount only the part of /proc related to pids
>> without rest objects. Since this is an addon to /proc, flags applied to
>> /proc have an effect on this pidfs filesystem.
> 
> I leave this to you and Eric, but imo it would be nice to avoid another
> filesystem.
> 
>> Why not implement it as another flag to /proc ?
>>
>> The /proc flags is stored in the pid_namespace and are global for
>> namespace. It means that if you add a flag to hide all except the pids,
>> then it will act on all mounted instances of /proc.
> 
> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> the simple patch below. Not sure it is correct/complete, just to illustrate
> the idea.
> 
> With this patch you can mount proc with -onodev and it will only show
> pids/self/thread_self:
> 
> 	# mkdir /tmp/D
> 	# mount -t proc -o nodev none /tmp/D
> 	# ls /tmp/D
> 	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
> 	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
> 	# cat /tmp/D/meminfo
> 	cat: /tmp/D/meminfo: No such file or directory
> 	# ls /tmp/D/irq
> 	ls: cannot open directory /tmp/D/irq: No such file or directory
> 
> No?

Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would
prefer seeing the nodev procfs version, rather than another fs.

As far as the overlayfs part is concerned, having an overlayfs mounted on /proc
inside container may result in problems as applications sometimes check for /proc
containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading
the /proc/mounts).

-- Pavel

> Oleg.
> 
> 
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
>  
>  int proc_readdir(struct file *file, struct dir_context *ctx)
>  {
> +	int mnt_flags = file->f_path.mnt->mnt_flags;
>  	struct inode *inode = file_inode(file);
>  
> +	if (mnt_flags & MNT_NODEV)
> +		return 1;
> +
>  	return proc_readdir_de(PDE(inode), file, ctx);
>  }
>  
> +static int proc_dir_open(struct inode *inode, struct file *file)
> +{
> +	if (file->f_path.mnt->mnt_flags & MNT_NODEV)
> +		return -ENOENT;
> +	return 0;
> +}
> +
>  /*
>   * These are the generic /proc directory operations. They
>   * use the in-memory "struct proc_dir_entry" tree to parse
> @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = {
>  	.llseek			= generic_file_llseek,
>  	.read			= generic_read_dir,
>  	.iterate_shared		= proc_readdir,
> +	.open			= proc_dir_open,
>  };
>  
>  /*
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
>  
>  static int proc_reg_open(struct inode *inode, struct file *file)
>  {
> +	int mnt_flags = file->f_path.mnt->mnt_flags;
>  	struct proc_dir_entry *pde = PDE(inode);
>  	int rv = 0;
>  	int (*open)(struct inode *, struct file *);
>  	int (*release)(struct inode *, struct file *);
>  	struct pde_opener *pdeo;
>  
> +	if (mnt_flags & MNT_NODEV)
> +		return -ENOENT;
> +
>  	/*
>  	 * Ensure that
>  	 * 1) PDE's ->release hook will be called no matter what
> 
> .
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-21 14:57 ` Oleg Nesterov
  2017-02-22  7:40   ` Pavel Emelyanov
@ 2017-02-22 11:53   ` Alexey Gladkov
  2017-02-22 15:37   ` Dmitry V. Levin
  2017-03-06 23:05     ` Alexey Gladkov
  3 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-02-22 11:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Eric W. Biederman, Pavel Emelyanov

On Tue, Feb 21, 2017 at 03:57:47PM +0100, Oleg Nesterov wrote:
> On 02/18, Alexey Gladkov wrote:
> >
> > This patch allows to mount only the part of /proc related to pids
> > without rest objects. Since this is an addon to /proc, flags applied to
> > /proc have an effect on this pidfs filesystem.
> 
> I leave this to you and Eric, but imo it would be nice to avoid another
> filesystem.
> 
> > Why not implement it as another flag to /proc ?
> >
> > The /proc flags is stored in the pid_namespace and are global for
> > namespace. It means that if you add a flag to hide all except the pids,
> > then it will act on all mounted instances of /proc.
> 
> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> the simple patch below. Not sure it is correct/complete, just to illustrate
> the idea.
> 
> With this patch you can mount proc with -onodev and it will only show
> pids/self/thread_self:
> 
> 	# mkdir /tmp/D
> 	# mount -t proc -o nodev none /tmp/D
> 	# ls /tmp/D
> 	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
> 	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
> 	# cat /tmp/D/meminfo
> 	cat: /tmp/D/meminfo: No such file or directory
> 	# ls /tmp/D/irq
> 	ls: cannot open directory /tmp/D/irq: No such file or directory
> 
> No?

I'm embarrassed that we start the change procfs by abuse something. It is very
difficult to explain why this option leads to such consequences on procfs.

Also with this change it won't be the same procfs. We can't to name it as procfs
because procfs have cpuinfo, meminfo, etc.

What do you think about this ?

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-22  7:40   ` Pavel Emelyanov
@ 2017-02-22 12:04     ` Alexey Gladkov
  2017-02-22 13:08       ` Pavel Emelyanov
  0 siblings, 1 reply; 53+ messages in thread
From: Alexey Gladkov @ 2017-02-22 12:04 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Dmitry V. Levin

On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote:
> On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
> > On 02/18, Alexey Gladkov wrote:
> >>
> >> This patch allows to mount only the part of /proc related to pids
> >> without rest objects. Since this is an addon to /proc, flags applied to
> >> /proc have an effect on this pidfs filesystem.
> > 
> > I leave this to you and Eric, but imo it would be nice to avoid another
> > filesystem.
> > 
> >> Why not implement it as another flag to /proc ?
> >>
> >> The /proc flags is stored in the pid_namespace and are global for
> >> namespace. It means that if you add a flag to hide all except the pids,
> >> then it will act on all mounted instances of /proc.
> > 
> > But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> > the simple patch below. Not sure it is correct/complete, just to illustrate
> > the idea.
> > 
> > With this patch you can mount proc with -onodev and it will only show
> > pids/self/thread_self:
> > 
> > 	# mkdir /tmp/D
> > 	# mount -t proc -o nodev none /tmp/D
> > 	# ls /tmp/D
> > 	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
> > 	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
> > 	# cat /tmp/D/meminfo
> > 	cat: /tmp/D/meminfo: No such file or directory
> > 	# ls /tmp/D/irq
> > 	ls: cannot open directory /tmp/D/irq: No such file or directory
> > 
> > No?
> 
> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would
> prefer seeing the nodev procfs version, rather than another fs.

But this is not procfs anymore. If someone will wait for procfs here it will
be disappointed :)

> As far as the overlayfs part is concerned, having an overlayfs mounted on /proc
> inside container may result in problems as applications sometimes check for /proc
> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading
> the /proc/mounts).

It is not a replacement for procfs. It's a subset of procfs. If someone wants
the procfs in the code we should not deceive him.

No?

> -- Pavel
> 
> > Oleg.
> > 
> > 
> > --- a/fs/proc/generic.c
> > +++ b/fs/proc/generic.c
> > @@ -305,11 +305,22 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
> >  
> >  int proc_readdir(struct file *file, struct dir_context *ctx)
> >  {
> > +	int mnt_flags = file->f_path.mnt->mnt_flags;
> >  	struct inode *inode = file_inode(file);
> >  
> > +	if (mnt_flags & MNT_NODEV)
> > +		return 1;
> > +
> >  	return proc_readdir_de(PDE(inode), file, ctx);
> >  }
> >  
> > +static int proc_dir_open(struct inode *inode, struct file *file)
> > +{
> > +	if (file->f_path.mnt->mnt_flags & MNT_NODEV)
> > +		return -ENOENT;
> > +	return 0;
> > +}
> > +
> >  /*
> >   * These are the generic /proc directory operations. They
> >   * use the in-memory "struct proc_dir_entry" tree to parse
> > @@ -319,6 +330,7 @@ static const struct file_operations proc_dir_operations = {
> >  	.llseek			= generic_file_llseek,
> >  	.read			= generic_read_dir,
> >  	.iterate_shared		= proc_readdir,
> > +	.open			= proc_dir_open,
> >  };
> >  
> >  /*
> > --- a/fs/proc/inode.c
> > +++ b/fs/proc/inode.c
> > @@ -318,12 +318,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
> >  
> >  static int proc_reg_open(struct inode *inode, struct file *file)
> >  {
> > +	int mnt_flags = file->f_path.mnt->mnt_flags;
> >  	struct proc_dir_entry *pde = PDE(inode);
> >  	int rv = 0;
> >  	int (*open)(struct inode *, struct file *);
> >  	int (*release)(struct inode *, struct file *);
> >  	struct pde_opener *pdeo;
> >  
> > +	if (mnt_flags & MNT_NODEV)
> > +		return -ENOENT;
> > +
> >  	/*
> >  	 * Ensure that
> >  	 * 1) PDE's ->release hook will be called no matter what
> > 
> > .
> > 

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-22 12:04     ` Alexey Gladkov
@ 2017-02-22 13:08       ` Pavel Emelyanov
  0 siblings, 0 replies; 53+ messages in thread
From: Pavel Emelyanov @ 2017-02-22 13:08 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Oleg Nesterov, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Dmitry V. Levin

On 02/22/2017 03:04 PM, Alexey Gladkov wrote:
> On Wed, Feb 22, 2017 at 10:40:49AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 05:57 PM, Oleg Nesterov wrote:
>>> On 02/18, Alexey Gladkov wrote:
>>>>
>>>> This patch allows to mount only the part of /proc related to pids
>>>> without rest objects. Since this is an addon to /proc, flags applied to
>>>> /proc have an effect on this pidfs filesystem.
>>>
>>> I leave this to you and Eric, but imo it would be nice to avoid another
>>> filesystem.
>>>
>>>> Why not implement it as another flag to /proc ?
>>>>
>>>> The /proc flags is stored in the pid_namespace and are global for
>>>> namespace. It means that if you add a flag to hide all except the pids,
>>>> then it will act on all mounted instances of /proc.
>>>
>>> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
>>> the simple patch below. Not sure it is correct/complete, just to illustrate
>>> the idea.
>>>
>>> With this patch you can mount proc with -onodev and it will only show
>>> pids/self/thread_self:
>>>
>>> 	# mkdir /tmp/D
>>> 	# mount -t proc -o nodev none /tmp/D
>>> 	# ls /tmp/D
>>> 	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
>>> 	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
>>> 	# cat /tmp/D/meminfo
>>> 	cat: /tmp/D/meminfo: No such file or directory
>>> 	# ls /tmp/D/irq
>>> 	ls: cannot open directory /tmp/D/irq: No such file or directory
>>>
>>> No?
>>
>> Yes!!! If this whole effort with pidfs and overlayfs will move forward, I would
>> prefer seeing the nodev procfs version, rather than another fs.
> 
> But this is not procfs anymore. If someone will wait for procfs here it will
> be disappointed :)

Well, it depends on what files he's looking for in there. This is what overlay
part should come for.

>> As far as the overlayfs part is concerned, having an overlayfs mounted on /proc
>> inside container may result in problems as applications sometimes check for /proc
>> containing procfs (by checking statfs.f_type == PROC_SUPER_MAGIC or by reading
>> the /proc/mounts).
> 
> It is not a replacement for procfs. It's a subset of procfs. If someone wants
> the procfs in the code we should not deceive him.
> 
> No?

But this is what we actually do -- Docker does with bind-mounts, LXC does with lxcfs,
OpenVZ does with kernel patches. Every time a container starts the regular /proc is
mutated not to show some information.

-- Pavel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-21 14:57 ` Oleg Nesterov
  2017-02-22  7:40   ` Pavel Emelyanov
  2017-02-22 11:53   ` Alexey Gladkov
@ 2017-02-22 15:37   ` Dmitry V. Levin
  2017-02-22 17:48     ` Oleg Nesterov
  2017-03-06 23:05     ` Alexey Gladkov
  3 siblings, 1 reply; 53+ messages in thread
From: Dmitry V. Levin @ 2017-02-22 15:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]

On Tue, Feb 21, 2017 at 03:57:47PM +0100, Oleg Nesterov wrote:
> On 02/18, Alexey Gladkov wrote:
> >
> > This patch allows to mount only the part of /proc related to pids
> > without rest objects. Since this is an addon to /proc, flags applied to
> > /proc have an effect on this pidfs filesystem.
> 
> I leave this to you and Eric, but imo it would be nice to avoid another
> filesystem.
> 
> > Why not implement it as another flag to /proc ?
> >
> > The /proc flags is stored in the pid_namespace and are global for
> > namespace. It means that if you add a flag to hide all except the pids,
> > then it will act on all mounted instances of /proc.
> 
> But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> the simple patch below. Not sure it is correct/complete, just to illustrate
> the idea.
> 
> With this patch you can mount proc with -onodev and it will only show
> pids/self/thread_self:
> 
> 	# mkdir /tmp/D
> 	# mount -t proc -o nodev none /tmp/D
> 	# ls /tmp/D
> 	1   11	13  15	17  19	20  22	24  28	3   31	33  4  56  7  9     thread-self
> 	10  12	14  16	18  2	21  23	27  29	30  32	34  5  6   8  self
> 	# cat /tmp/D/meminfo
> 	cat: /tmp/D/meminfo: No such file or directory
> 	# ls /tmp/D/irq
> 	ls: cannot open directory /tmp/D/irq: No such file or directory
> 
> No?

I like the idea of using mnt_flags to turn procfs into pidfs, thus
avoiding yet another filesystem, but MNT_NODEV has a different meaning,
namely "do not interpret character or block special devices on the file
system".  I've actually found a system nearby that already mounts /proc
with nodev:

# grep 'proc.*nodev' /proc/mounts
proc /var/lib/vz/root/1002/proc proc rw,nosuid,nodev,noexec,relatime,gid=19,hidepid=2 0 0
proc /var/lib/vz/root/1003/proc proc rw,nosuid,nodev,noexec,relatime,gid=19,hidepid=2 0 0
proc /var/lib/vz/root/1004/proc proc rw,nosuid,nodev,noexec,relatime,gid=19,hidepid=2 0 0
proc /var/lib/vz/root/1005/proc proc rw,nosuid,nodev,noexec,relatime,gid=19,hidepid=2 0 0


-- 
ldv

[-- Attachment #2: Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-22 15:37   ` Dmitry V. Levin
@ 2017-02-22 17:48     ` Oleg Nesterov
  2017-02-22 19:56       ` Alexey Gladkov
  0 siblings, 1 reply; 53+ messages in thread
From: Oleg Nesterov @ 2017-02-22 17:48 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov

On 02/22, Dmitry V. Levin wrote:
>
> On Tue, Feb 21, 2017 at 03:57:47PM +0100, Oleg Nesterov wrote:
> >
> > But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> > the simple patch below. Not sure it is correct/complete, just to illustrate
> > the idea.

...

> I like the idea of using mnt_flags to turn procfs into pidfs, thus
> avoiding yet another filesystem, but MNT_NODEV has a different meaning,

Yes, yes, see "abuse MNT_NODEV" above. The patch just tries to illustrate
the idea.

Oleg.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-22 17:48     ` Oleg Nesterov
@ 2017-02-22 19:56       ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-02-22 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dmitry V. Levin, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov

On Wed, Feb 22, 2017 at 06:48:03PM +0100, Oleg Nesterov wrote:
> On 02/22, Dmitry V. Levin wrote:
> >
> > On Tue, Feb 21, 2017 at 03:57:47PM +0100, Oleg Nesterov wrote:
> > >
> > > But perhaps we can use mnt_flags? For example, lets abuse MNT_NODEV, see
> > > the simple patch below. Not sure it is correct/complete, just to illustrate
> > > the idea.
> 
> ...
> 
> > I like the idea of using mnt_flags to turn procfs into pidfs, thus
> > avoiding yet another filesystem, but MNT_NODEV has a different meaning,
> 
> Yes, yes, see "abuse MNT_NODEV" above. The patch just tries to illustrate
> the idea.

I think I can try to use subtype for this. In this case, it is still the same
procfs, but with the subtype:

mount -t proc.pidfs ...

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-20  4:05 ` Eric W. Biederman
  2017-02-20 10:36   ` Alexey Gladkov
@ 2017-02-22 20:11   ` Richard Weinberger
  1 sibling, 0 replies; 53+ messages in thread
From: Richard Weinberger @ 2017-02-22 20:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Oleg Nesterov, Pavel Emelyanov,
	Andy Lutomirski

On Mon, Feb 20, 2017 at 5:05 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
>
>> The pidfs filesystem contains a subset of the /proc file system which
>> contains only information about the processes.
>
> My summary of your motivation.
>
> It hurts when I create a container with a processes with uid 0 inside of
> it.  This generates lots of hacks to attempt to limit uid 0.
>
>   My answer:  Don't run a container with a real uid 0 inside of it.

I agree. Unless I miss something I'd say use a user namespace
to get decent permission checks in /proc (and /sys).

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] Add pidfs filesystem
  2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
                   ` (3 preceding siblings ...)
  2017-02-21 14:57 ` Oleg Nesterov
@ 2017-02-27 18:56 ` Michael Kerrisk
  4 siblings, 0 replies; 53+ messages in thread
From: Michael Kerrisk @ 2017-02-27 18:56 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Vasiliy Kulikov,
	Al Viro, Eric W. Biederman, Oleg Nesterov, Pavel Emelyanov,
	Linux API

[CC += linux-api@vger.kernel.org]

Hi Alexey

This is a change to the kernel-user-space API. Please CC
linux-api@vger.kernel.org on any future iterations of this patch.

Thanks,

Michael


On Sat, Feb 18, 2017 at 11:53 PM, Alexey Gladkov
<gladkov.alexey@gmail.com> wrote:
> The pidfs filesystem contains a subset of the /proc file system which
> contains only information about the processes.
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids
> without rest objects. Since this is an addon to /proc, flags applied to
> /proc have an effect on this pidfs filesystem.
>
> Why not implement it as another flag to /proc ?
>
> The /proc flags is stored in the pid_namespace and are global for
> namespace. It means that if you add a flag to hide all except the pids,
> then it will act on all mounted instances of /proc.
>
> Originally the idea was that the container will be mounted only pidfs
> and additional required files will be mounted on top using the
> overlayfs. But I found out that /proc does not support overlayfs and
> does not allow to mount anything on top or under it.
>
> My question is whether it's possible to add overlayfs support for /proc?
>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---
>  Documentation/filesystems/pidfs.txt | 16 ++++++++
>  fs/proc/Kconfig                     |  8 ++++
>  fs/proc/inode.c                     |  8 +++-
>  fs/proc/internal.h                  |  2 +
>  fs/proc/root.c                      | 76 ++++++++++++++++++++++++++++++++++---
>  fs/proc/self.c                      |  6 +++
>  fs/proc/thread_self.c               |  6 +++
>  include/linux/pid_namespace.h       |  5 +++
>  8 files changed, 119 insertions(+), 8 deletions(-)
>  create mode 100644 Documentation/filesystems/pidfs.txt
>
> diff --git a/Documentation/filesystems/pidfs.txt b/Documentation/filesystems/pidfs.txt
> new file mode 100644
> index 0000000..ce958a5
> --- /dev/null
> +++ b/Documentation/filesystems/pidfs.txt
> @@ -0,0 +1,16 @@
> +The PIDFS Filesystem
> +====================
> +
> +The pidfs filesystem contains a subset of the /proc file system which contains
> +only information about the processes. The link self points to the process
> +reading the file system. All other special files and directories in /proc are
> +not available in this filesystem.
> +
> +The pidfs is not an independent filesystem, its implementation shares code
> +with /proc.
> +
> +All mount options applicable to /proc filesystem are also applicable
> +to pidfs filesystem. For example, access to the information in /proc/[pid]
> +directories can be restricted using hidepid option.
> +
> +To get more information about the processes read the proc.txt
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 1ade120..fa568f6 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -43,6 +43,14 @@ config PROC_VMCORE
>          help
>          Exports the dump image of crashed kernel in ELF format.
>
> +config PROC_PIDFS
> +       bool "pidfs file system support"
> +       depends on PROC_FS
> +       default n
> +       help
> +         The pidfs filesystem contains a subset of the /proc file system
> +         which contains only information only about the processes.
> +
>  config PROC_SYSCTL
>         bool "Sysctl support (/proc/sys)" if EXPERT
>         depends on PROC_FS
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 783bc19..1be65b4 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -474,12 +474,16 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
>  int proc_fill_super(struct super_block *s, void *data, int silent)
>  {
>         struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
> +       struct proc_dir_entry *fs_root = &proc_root;
>         struct inode *root_inode;
>         int ret;
>
>         if (!proc_parse_options(data, ns))
>                 return -EINVAL;
>
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type)
> +               fs_root = &pidfs_root;
> +
>         /* User space would break if executables or devices appear on proc */
>         s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
>         s->s_flags |= MS_NODIRATIME | MS_NOSUID | MS_NOEXEC;
> @@ -496,8 +500,8 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
>          */
>         s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
>
> -       pde_get(&proc_root);
> -       root_inode = proc_get_inode(s, &proc_root);
> +       pde_get(fs_root);
> +       root_inode = proc_get_inode(s, fs_root);
>         if (!root_inode) {
>                 pr_err("proc_fill_super: get root inode failed\n");
>                 return -ENOMEM;
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 2de5194..a7c068c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -267,6 +267,8 @@ static inline void proc_tty_init(void) {}
>  /*
>   * root.c
>   */
> +extern struct file_system_type pidfs_fs_type;
> +extern struct proc_dir_entry pidfs_root;
>  extern struct proc_dir_entry proc_root;
>  extern int proc_parse_options(char *options, struct pid_namespace *pid);
>
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 4bd0373..de16ac1 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -102,10 +102,21 @@ static void proc_kill_sb(struct super_block *sb)
>         struct pid_namespace *ns;
>
>         ns = (struct pid_namespace *)sb->s_fs_info;
> -       if (ns->proc_self)
> -               dput(ns->proc_self);
> -       if (ns->proc_thread_self)
> -               dput(ns->proc_thread_self);
> +
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS) && sb->s_type == &pidfs_fs_type) {
> +               if (ns->pidfs_self)
> +                       dput(ns->pidfs_self);
> +
> +               if (ns->pidfs_thread_self)
> +                       dput(ns->pidfs_thread_self);
> +       } else {
> +               if (ns->proc_self)
> +                       dput(ns->proc_self);
> +
> +               if (ns->proc_thread_self)
> +                       dput(ns->proc_thread_self);
> +       }
> +
>         kill_anon_super(sb);
>         put_pid_ns(ns);
>  }
> @@ -117,6 +128,13 @@ static struct file_system_type proc_fs_type = {
>         .fs_flags       = FS_USERNS_MOUNT,
>  };
>
> +struct file_system_type pidfs_fs_type = {
> +       .name           = "pidfs",
> +       .mount          = proc_mount,
> +       .kill_sb        = proc_kill_sb,
> +       .fs_flags       = FS_USERNS_MOUNT,
> +};
> +
>  void __init proc_root_init(void)
>  {
>         int err;
> @@ -127,6 +145,10 @@ void __init proc_root_init(void)
>         if (err)
>                 return;
>
> +       err = register_filesystem(&pidfs_fs_type);
> +       if (err)
> +               return;
> +
>         proc_self_init();
>         proc_thread_self_init();
>         proc_symlink("mounts", NULL, "self/mounts");
> @@ -148,8 +170,7 @@ void __init proc_root_init(void)
>         proc_sys_init();
>  }
>
> -static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat
> -)
> +static int proc_root_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
>  {
>         generic_fillattr(d_inode(dentry), stat);
>         stat->nlink = proc_root.nlink + nr_processes();
> @@ -176,6 +197,14 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
>         return proc_pid_readdir(file, ctx);
>  }
>
> +static int pidfs_root_readdir(struct file *file, struct dir_context *ctx)
> +{
> +       if (ctx->pos < FIRST_PROCESS_ENTRY)
> +               ctx->pos = FIRST_PROCESS_ENTRY;
> +
> +       return proc_pid_readdir(file, ctx);
> +}
> +
>  /*
>   * The root /proc directory is special, as it has the
>   * <pid> directories. Thus we don't use the generic
> @@ -187,6 +216,12 @@ static const struct file_operations proc_root_operations = {
>         .llseek         = generic_file_llseek,
>  };
>
> +static const struct file_operations pidfs_root_operations = {
> +       .read            = generic_read_dir,
> +       .iterate_shared  = pidfs_root_readdir,
> +       .llseek         = generic_file_llseek,
> +};
> +
>  /*
>   * proc root can do almost nothing..
>   */
> @@ -195,6 +230,11 @@ static const struct inode_operations proc_root_inode_operations = {
>         .getattr        = proc_root_getattr,
>  };
>
> +static const struct inode_operations pidfs_root_inode_operations = {
> +       .lookup         = proc_pid_lookup,
> +       .getattr        = proc_root_getattr,
> +};
> +
>  /*
>   * This is the root "inode" in the /proc tree..
>   */
> @@ -211,6 +251,19 @@ struct proc_dir_entry proc_root = {
>         .name           = "/proc",
>  };
>
> +struct proc_dir_entry pidfs_root = {
> +       .low_ino        = PROC_ROOT_INO,
> +       .namelen        = 6,
> +       .mode           = S_IFDIR | S_IRUGO | S_IXUGO,
> +       .nlink          = 2,
> +       .count          = ATOMIC_INIT(1),
> +       .proc_iops      = &pidfs_root_inode_operations,
> +       .proc_fops      = &pidfs_root_operations,
> +       .parent         = &pidfs_root,
> +       .subdir         = RB_ROOT,
> +       .name           = "/pidfs",
> +};
> +
>  int pid_ns_prepare_proc(struct pid_namespace *ns)
>  {
>         struct vfsmount *mnt;
> @@ -220,10 +273,21 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
>                 return PTR_ERR(mnt);
>
>         ns->proc_mnt = mnt;
> +
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS)) {
> +               mnt = kern_mount_data(&pidfs_fs_type, ns);
> +               if (IS_ERR(mnt))
> +                       return PTR_ERR(mnt);
> +
> +               ns->pidfs_mnt = mnt;
> +       }
>         return 0;
>  }
>
>  void pid_ns_release_proc(struct pid_namespace *ns)
>  {
>         kern_unmount(ns->proc_mnt);
> +
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS))
> +               kern_unmount(ns->pidfs_mnt);
>  }
> diff --git a/fs/proc/self.c b/fs/proc/self.c
> index 4024595..dea7e17 100644
> --- a/fs/proc/self.c
> +++ b/fs/proc/self.c
> @@ -74,6 +74,12 @@ int proc_setup_self(struct super_block *s)
>                 pr_err("proc_fill_super: can't allocate /proc/self\n");
>                 return PTR_ERR(self);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> +               ns->pidfs_self = self;
> +               return 0;
> +       }
> +
>         ns->proc_self = self;
>         return 0;
>  }
> diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> index 595b90a97..274c618 100644
> --- a/fs/proc/thread_self.c
> +++ b/fs/proc/thread_self.c
> @@ -76,6 +76,12 @@ int proc_setup_thread_self(struct super_block *s)
>                 pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
>                 return PTR_ERR(thread_self);
>         }
> +
> +       if (IS_ENABLED(CONFIG_PROC_PIDFS) && s->s_type == &pidfs_fs_type) {
> +               ns->pidfs_thread_self = thread_self;
> +               return 0;
> +       }
> +
>         ns->proc_thread_self = thread_self;
>         return 0;
>  }
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 34cce96..fca3a76 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -46,6 +46,11 @@ struct pid_namespace {
>         int hide_pid;
>         int reboot;     /* group exit code if this pidns was rebooted */
>         struct ns_common ns;
> +#ifdef CONFIG_PROC_PIDFS
> +       struct vfsmount *pidfs_mnt;
> +       struct dentry *pidfs_self;
> +       struct dentry *pidfs_thread_self;
> +#endif
>  };
>
>  extern struct pid_namespace init_pid_ns;
> --
> 2.10.2
>
>
> --
> Rgrds, legion
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC] Add option to mount only a pids subset
@ 2017-03-06 23:05     ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-06 23:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linux API, Kirill A. Shutemov, Vasiliy Kulikov, Al Viro,
	Eric W. Biederman, Oleg Nesterov, Pavel Emelyanov,
	James Bottomley


After discussion with Oleg Nesterov I reimplement my patch as an additional
option for /proc. This option affects the mountpoint. It means that in one
pid namespace it possible to have both the whole traditional /proc and
/proc with only pids subset.

However, it remains an open question about overlayfs support in the /proc.

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an option for /proc, flags applied to
/proc have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid sunset
and additional required files will be mounted on top using the overlayfs.
But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

 * Show pidonly via proc_show_options.
 * Add overlayfs support.

---
 fs/internal.h         |  1 +
 fs/namespace.c        |  9 ++++++++
 fs/proc/generic.c     |  4 ++++
 fs/proc/inode.c       |  7 +++++-
 fs/proc/internal.h    |  8 +++++++
 fs/proc/root.c        | 62 +++++++++++++++++++++++++++++++++++++++++++++++----
 fs/stat.c             |  4 ++++
 fs/super.c            | 20 +++++++++++++++++
 include/linux/fs.h    |  2 ++
 include/linux/mount.h |  1 +
 10 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 4fcf517..cb44ca7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void);
  * super.c
  */
 extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
 			       int, const char *, void *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..6cb2bcb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p)
 struct vfsmount *
 vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
 {
+	int err;
 	struct mount *mnt;
 	struct dentry *root;
 
@@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 	mnt->mnt.mnt_sb = root->d_sb;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
+
+	err = do_mount_sb(&mnt->mnt, flags, data);
+	if(err) {
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
 	unlock_mount_hash();
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7eb3cef..dd3da60 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
+	struct proc_options *opts = file->f_path.mnt->fs_data;
+
+	if (opts && opts->pid_only)
+		return 1;
 
 	return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..51b1712 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
 	if (pid->hide_pid != 0)
 		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
-
 	return 0;
 }
 
@@ -118,6 +117,8 @@ static const struct super_operations proc_sops = {
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= proc_evict_inode,
 	.statfs		= simple_statfs,
+	.getattr_fs	= proc_getattrfs,
+	.mount_fs	= proc_mount_cb,
 	.remount_fs	= proc_remount,
 	.show_options	= proc_show_options,
 };
@@ -323,6 +324,10 @@ static int proc_reg_open(struct inode *inode, struct file *file)
 	int (*open)(struct inode *, struct file *);
 	int (*release)(struct inode *, struct file *);
 	struct pde_opener *pdeo;
+	struct proc_options *opts = file->f_path.mnt->fs_data;
+
+	if (opts && opts->pid_only)
+		return -ENOENT;
 
 	/*
 	 * Ensure that
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194..7fdbfee 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -267,11 +267,19 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+struct proc_options {
+	kgid_t pid_gid;
+	int hide_pid;
+	int pid_only;
+};
+
 extern struct proc_dir_entry proc_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data);
+extern int proc_getattrfs(struct vfsmount *, struct dentry *, struct kstat *);
 
 /*
  * task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4bd0373..e07f37a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -14,6 +14,7 @@
 #include <linux/stat.h>
 #include <linux/init.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/user_namespace.h>
@@ -24,16 +25,17 @@
 #include "internal.h"
 
 enum {
-	Opt_gid, Opt_hidepid, Opt_err,
+	Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
 };
 
 static const match_table_t tokens = {
 	{Opt_hidepid, "hidepid=%u"},
 	{Opt_gid, "gid=%u"},
+	{Opt_pidonly, "pidonly"},
 	{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
 {
 	char *p;
 	substring_t args[MAX_OPT_ARGS];
@@ -53,7 +55,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 		case Opt_gid:
 			if (match_int(&args[0], &option))
 				return 0;
-			pid->pid_gid = make_kgid(current_user_ns(), option);
+			fs_opts->pid_gid = make_kgid(current_user_ns(), option);
 			break;
 		case Opt_hidepid:
 			if (match_int(&args[0], &option))
@@ -62,7 +64,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 				pr_err("proc: hidepid value must be between 0 and 2.\n");
 				return 0;
 			}
-			pid->hide_pid = option;
+			fs_opts->hide_pid = option;
+			break;
+		case Opt_pidonly:
+			fs_opts->pid_only = 1;
 			break;
 		default:
 			pr_err("proc: unrecognized mount option \"%s\" "
@@ -74,6 +79,19 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+	struct proc_options opts;
+
+	if (!proc_fill_options(options, &opts))
+		return 0;
+
+	pid->pid_gid = opts.pid_gid;
+	pid->hide_pid = opts.hide_pid;
+
+	return 1;
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
@@ -82,6 +100,42 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
 	return !proc_parse_options(data, pid);
 }
 
+int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+	struct inode *inode = d_inode(dentry);
+	struct pid *pid = proc_pid(dentry->d_inode);
+	struct proc_options *opts = mnt->fs_data;
+
+	if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
+		return -ENOENT;
+
+	if (!inode->i_op->getattr) {
+		generic_fillattr(inode, stat);
+		return 0;
+	}
+
+	return inode->i_op->getattr(mnt, dentry, stat);
+}
+
+int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data)
+{
+	struct proc_options *opts;
+
+	if (!data || *flags & MS_KERNMOUNT)
+		return 0;
+
+	opts = kzalloc(sizeof(struct proc_options), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	if (!proc_fill_options(data, opts))
+		return -EINVAL;
+
+	mnt->fs_data = opts;
+
+	return 0;
+}
+
 static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
diff --git a/fs/stat.c b/fs/stat.c
index bc045c7..1e26308 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -10,6 +10,7 @@
 #include <linux/file.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
@@ -53,6 +54,9 @@ int vfs_getattr_nosec(struct path *path, struct kstat *stat)
 {
 	struct inode *inode = d_backing_inode(path->dentry);
 
+	if (path->mnt->mnt_sb->s_op->getattr_fs)
+		return path->mnt->mnt_sb->s_op->getattr_fs(path->mnt, path->dentry, stat);
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path->mnt, path->dentry, stat);
 
diff --git a/fs/super.c b/fs/super.c
index c183835..478dd5b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -832,6 +832,26 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	return retval;
 }
 
+/**
+ *	do_mount_sb - asks filesystem to finish mount with options.
+ *	@mnt:	vfsmount in question
+ *	@flags:	numeric part of options
+ *	@data:	the rest of options
+ *
+ *	Alters the mount options of a mounted filesystem.
+ */
+int do_mount_sb(struct vfsmount *mnt, int flags, void *data)
+{
+	if (mnt->mnt_sb->s_writers.frozen != SB_UNFROZEN)
+		return -EBUSY;
+
+	if (mnt->mnt_sb->s_op->mount_fs) {
+		return mnt->mnt_sb->s_op->mount_fs(mnt, &flags, data);
+	}
+
+	return 0;
+}
+
 static void do_emergency_remount(struct work_struct *work)
 {
 	struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 83de8b6..5bd1b84 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1759,6 +1759,8 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
+	int (*mount_fs) (struct vfsmount *, int *, char *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..4bd364e 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -67,6 +67,7 @@ struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
 	int mnt_flags;
+	void *fs_data;			/* fs-specific data */
 };
 
 struct file; /* forward dec */
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC] Add option to mount only a pids subset
@ 2017-03-06 23:05     ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-06 23:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linux API, Kirill A. Shutemov, Vasiliy Kulikov, Al Viro,
	Eric W. Biederman, Oleg Nesterov, Pavel Emelyanov,
	James Bottomley


After discussion with Oleg Nesterov I reimplement my patch as an additional
option for /proc. This option affects the mountpoint. It means that in one
pid namespace it possible to have both the whole traditional /proc and
/proc with only pids subset.

However, it remains an open question about overlayfs support in the /proc.

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an option for /proc, flags applied to
/proc have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid sunset
and additional required files will be mounted on top using the overlayfs.
But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

 * Show pidonly via proc_show_options.
 * Add overlayfs support.

---
 fs/internal.h         |  1 +
 fs/namespace.c        |  9 ++++++++
 fs/proc/generic.c     |  4 ++++
 fs/proc/inode.c       |  7 +++++-
 fs/proc/internal.h    |  8 +++++++
 fs/proc/root.c        | 62 +++++++++++++++++++++++++++++++++++++++++++++++----
 fs/stat.c             |  4 ++++
 fs/super.c            | 20 +++++++++++++++++
 include/linux/fs.h    |  2 ++
 include/linux/mount.h |  1 +
 10 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 4fcf517..cb44ca7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void);
  * super.c
  */
 extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
 			       int, const char *, void *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..6cb2bcb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p)
 struct vfsmount *
 vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data)
 {
+	int err;
 	struct mount *mnt;
 	struct dentry *root;
 
@@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
 	mnt->mnt.mnt_sb = root->d_sb;
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
+
+	err = do_mount_sb(&mnt->mnt, flags, data);
+	if(err) {
+		mnt_free_id(mnt);
+		free_vfsmnt(mnt);
+		return ERR_PTR(err);
+	}
+
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
 	unlock_mount_hash();
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7eb3cef..dd3da60 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
+	struct proc_options *opts = file->f_path.mnt->fs_data;
+
+	if (opts && opts->pid_only)
+		return 1;
 
 	return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..51b1712 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
 	if (pid->hide_pid != 0)
 		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
-
 	return 0;
 }
 
@@ -118,6 +117,8 @@ static const struct super_operations proc_sops = {
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= proc_evict_inode,
 	.statfs		= simple_statfs,
+	.getattr_fs	= proc_getattrfs,
+	.mount_fs	= proc_mount_cb,
 	.remount_fs	= proc_remount,
 	.show_options	= proc_show_options,
 };
@@ -323,6 +324,10 @@ static int proc_reg_open(struct inode *inode, struct file *file)
 	int (*open)(struct inode *, struct file *);
 	int (*release)(struct inode *, struct file *);
 	struct pde_opener *pdeo;
+	struct proc_options *opts = file->f_path.mnt->fs_data;
+
+	if (opts && opts->pid_only)
+		return -ENOENT;
 
 	/*
 	 * Ensure that
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194..7fdbfee 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -267,11 +267,19 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+struct proc_options {
+	kgid_t pid_gid;
+	int hide_pid;
+	int pid_only;
+};
+
 extern struct proc_dir_entry proc_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
 extern int proc_remount(struct super_block *, int *, char *);
+extern int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data);
+extern int proc_getattrfs(struct vfsmount *, struct dentry *, struct kstat *);
 
 /*
  * task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4bd0373..e07f37a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -14,6 +14,7 @@
 #include <linux/stat.h>
 #include <linux/init.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/user_namespace.h>
@@ -24,16 +25,17 @@
 #include "internal.h"
 
 enum {
-	Opt_gid, Opt_hidepid, Opt_err,
+	Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
 };
 
 static const match_table_t tokens = {
 	{Opt_hidepid, "hidepid=%u"},
 	{Opt_gid, "gid=%u"},
+	{Opt_pidonly, "pidonly"},
 	{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
 {
 	char *p;
 	substring_t args[MAX_OPT_ARGS];
@@ -53,7 +55,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 		case Opt_gid:
 			if (match_int(&args[0], &option))
 				return 0;
-			pid->pid_gid = make_kgid(current_user_ns(), option);
+			fs_opts->pid_gid = make_kgid(current_user_ns(), option);
 			break;
 		case Opt_hidepid:
 			if (match_int(&args[0], &option))
@@ -62,7 +64,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 				pr_err("proc: hidepid value must be between 0 and 2.\n");
 				return 0;
 			}
-			pid->hide_pid = option;
+			fs_opts->hide_pid = option;
+			break;
+		case Opt_pidonly:
+			fs_opts->pid_only = 1;
 			break;
 		default:
 			pr_err("proc: unrecognized mount option \"%s\" "
@@ -74,6 +79,19 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+	struct proc_options opts;
+
+	if (!proc_fill_options(options, &opts))
+		return 0;
+
+	pid->pid_gid = opts.pid_gid;
+	pid->hide_pid = opts.hide_pid;
+
+	return 1;
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
@@ -82,6 +100,42 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
 	return !proc_parse_options(data, pid);
 }
 
+int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+	struct inode *inode = d_inode(dentry);
+	struct pid *pid = proc_pid(dentry->d_inode);
+	struct proc_options *opts = mnt->fs_data;
+
+	if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
+		return -ENOENT;
+
+	if (!inode->i_op->getattr) {
+		generic_fillattr(inode, stat);
+		return 0;
+	}
+
+	return inode->i_op->getattr(mnt, dentry, stat);
+}
+
+int proc_mount_cb(struct vfsmount *mnt, int *flags, char *data)
+{
+	struct proc_options *opts;
+
+	if (!data || *flags & MS_KERNMOUNT)
+		return 0;
+
+	opts = kzalloc(sizeof(struct proc_options), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	if (!proc_fill_options(data, opts))
+		return -EINVAL;
+
+	mnt->fs_data = opts;
+
+	return 0;
+}
+
 static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
diff --git a/fs/stat.c b/fs/stat.c
index bc045c7..1e26308 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -10,6 +10,7 @@
 #include <linux/file.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
@@ -53,6 +54,9 @@ int vfs_getattr_nosec(struct path *path, struct kstat *stat)
 {
 	struct inode *inode = d_backing_inode(path->dentry);
 
+	if (path->mnt->mnt_sb->s_op->getattr_fs)
+		return path->mnt->mnt_sb->s_op->getattr_fs(path->mnt, path->dentry, stat);
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path->mnt, path->dentry, stat);
 
diff --git a/fs/super.c b/fs/super.c
index c183835..478dd5b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -832,6 +832,26 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	return retval;
 }
 
+/**
+ *	do_mount_sb - asks filesystem to finish mount with options.
+ *	@mnt:	vfsmount in question
+ *	@flags:	numeric part of options
+ *	@data:	the rest of options
+ *
+ *	Alters the mount options of a mounted filesystem.
+ */
+int do_mount_sb(struct vfsmount *mnt, int flags, void *data)
+{
+	if (mnt->mnt_sb->s_writers.frozen != SB_UNFROZEN)
+		return -EBUSY;
+
+	if (mnt->mnt_sb->s_op->mount_fs) {
+		return mnt->mnt_sb->s_op->mount_fs(mnt, &flags, data);
+	}
+
+	return 0;
+}
+
 static void do_emergency_remount(struct work_struct *work)
 {
 	struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 83de8b6..5bd1b84 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1759,6 +1759,8 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
+	int (*mount_fs) (struct vfsmount *, int *, char *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..4bd364e 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -67,6 +67,7 @@ struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
 	int mnt_flags;
+	void *fs_data;			/* fs-specific data */
 };
 
 struct file; /* forward dec */
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-06 23:05     ` Alexey Gladkov
@ 2017-03-07 16:24       ` Andy Lutomirski
  -1 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-07 16:24 UTC (permalink / raw)
  To: Alexey Gladkov, Djalal Harouni
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> After discussion with Oleg Nesterov I reimplement my patch as an additional
> option for /proc. This option affects the mountpoint. It means that in one
> pid namespace it possible to have both the whole traditional /proc and
> /proc with only pids subset.
>

I like this.  I think you should split it into two patches, though:
one that reworks how procfs gets mounted and one that makes adds the
new functionality.

Djajal had some concerns about the first part breaking applications
that use stat and expect certain behavior.  This should be manageable,
though, but making stat work appropriately.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-07 16:24       ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-07 16:24 UTC (permalink / raw)
  To: Alexey Gladkov, Djalal Harouni
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> After discussion with Oleg Nesterov I reimplement my patch as an additional
> option for /proc. This option affects the mountpoint. It means that in one
> pid namespace it possible to have both the whole traditional /proc and
> /proc with only pids subset.
>

I like this.  I think you should split it into two patches, though:
one that reworks how procfs gets mounted and one that makes adds the
new functionality.

Djajal had some concerns about the first part breaking applications
that use stat and expect certain behavior.  This should be manageable,
though, but making stat work appropriately.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-06 23:05     ` Alexey Gladkov
  (?)
  (?)
@ 2017-03-07 17:49     ` Oleg Nesterov
  2017-03-10 23:46         ` Alexey Gladkov
  -1 siblings, 1 reply; 53+ messages in thread
From: Oleg Nesterov @ 2017-03-07 17:49 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley

I can't really review this... but in any case I think you should split
this patch to separate the vfs and proc changes.

On 03/07, Alexey Gladkov wrote:
>
> @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
>  	mnt->mnt.mnt_sb = root->d_sb;
>  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>  	mnt->mnt_parent = mnt;
> +
> +	err = do_mount_sb(&mnt->mnt, flags, data);
> +	if(err) {
> +		mnt_free_id(mnt);
> +		free_vfsmnt(mnt);
> +		return ERR_PTR(err);
> +	}

This duplicates the error handling, we do the same if mount_fs() fails.
Perhaps you should move these 2 lines into cleanup block and add goto's.

> +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct pid *pid = proc_pid(dentry->d_inode);
> +	struct proc_options *opts = mnt->fs_data;
> +
> +	if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> +		return -ENOENT;

Hmm. I don't quite understand why do we need this, and how this should work.

Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
but only because they both do stat() ?

Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
work ?

I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
out we also need to update proc_sys_dir_file_operations too and may be something
else...

> +
> +	if (!inode->i_op->getattr) {
> +		generic_fillattr(inode, stat);
> +		return 0;
> +	}
> +
> +	return inode->i_op->getattr(mnt, dentry, stat);
> +}

Oh, it would be nice to not duplicate the code from the caller, imo.

Oleg.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-07 16:24       ` Andy Lutomirski
  (?)
@ 2017-03-09 11:26       ` Djalal Harouni
  2017-03-09 20:52           ` Eric W. Biederman
  2017-03-11 21:51           ` Alexey Gladkov
  -1 siblings, 2 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-09 11:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
>
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.
>
> Djajal had some concerns about the first part breaking applications
> that use stat and expect certain behavior.  This should be manageable,
> though, but making stat work appropriately.

I'm bit lost in the two discussion, however the main concern I was
discussing with Andy was if you have per superblock proc mounts then
each mount will end up with its own device ID st_dev, right now they
share the same ID if they are in the same pid namespace, but if we
change that then we may break the following:
http://man7.org/linux/man-pages/man7/namespaces.7.html

Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
suggests to follow up with fstat() to identify the namespaces..
"By applying fstat(2) to the returned file descriptor, one obtains a
stat structure whose st_dev (resident device) and st_ino (inode
number) fields together identify the owning/parent namespace."

Other /proc/self/ns/* comparison and stat() logic...

Andy suggested that we may have the same st_dev for mounts in the same
pid namespace... I'm not sure which side effect this may bring!

Thanks!

-- 
tixxdz

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-09 20:52           ` Eric W. Biederman
  0 siblings, 0 replies; 53+ messages in thread
From: Eric W. Biederman @ 2017-03-09 20:52 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andy Lutomirski, Alexey Gladkov, Linux Kernel Mailing List,
	Linux API, Kirill A. Shutemov, Vasiliy Kulikov, Al Viro,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

Djalal Harouni <tixxdz@gmail.com> writes:

> On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>> >
>> > After discussion with Oleg Nesterov I reimplement my patch as an additional
>> > option for /proc. This option affects the mountpoint. It means that in one
>> > pid namespace it possible to have both the whole traditional /proc and
>> > /proc with only pids subset.
>> >
>>
>> I like this.  I think you should split it into two patches, though:
>> one that reworks how procfs gets mounted and one that makes adds the
>> new functionality.
>>
>> Djajal had some concerns about the first part breaking applications
>> that use stat and expect certain behavior.  This should be manageable,
>> though, but making stat work appropriately.
>
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html
>
> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Well right now it is less of an issue than you image.  I suggest you
stat /proc and /proc/self/ns/* and look at st_dev.

That said anything that changes today's proc needs to be tested with a
range of distro's especially those using selinux and apparmor as they
tend to have policies that are very sensitive to the implementation
details.  Such that seemingly innocent changes can cause systems to stop
booting.

Eric

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-09 20:52           ` Eric W. Biederman
  0 siblings, 0 replies; 53+ messages in thread
From: Eric W. Biederman @ 2017-03-09 20:52 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andy Lutomirski, Alexey Gladkov, Linux Kernel Mailing List,
	Linux API, Kirill A. Shutemov, Vasiliy Kulikov, Al Viro,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

Djalal Harouni <tixxdz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>
>> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >
>> > After discussion with Oleg Nesterov I reimplement my patch as an additional
>> > option for /proc. This option affects the mountpoint. It means that in one
>> > pid namespace it possible to have both the whole traditional /proc and
>> > /proc with only pids subset.
>> >
>>
>> I like this.  I think you should split it into two patches, though:
>> one that reworks how procfs gets mounted and one that makes adds the
>> new functionality.
>>
>> Djajal had some concerns about the first part breaking applications
>> that use stat and expect certain behavior.  This should be manageable,
>> though, but making stat work appropriately.
>
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html
>
> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Well right now it is less of an issue than you image.  I suggest you
stat /proc and /proc/self/ns/* and look at st_dev.

That said anything that changes today's proc needs to be tested with a
range of distro's especially those using selinux and apparmor as they
tend to have policies that are very sensitive to the implementation
details.  Such that seemingly innocent changes can cause systems to stop
booting.

Eric

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-10 23:46         ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-10 23:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote:
> I can't really review this... but in any case I think you should split
> this patch to separate the vfs and proc changes.
> 
> On 03/07, Alexey Gladkov wrote:
> >
> > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
> >  	mnt->mnt.mnt_sb = root->d_sb;
> >  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> >  	mnt->mnt_parent = mnt;
> > +
> > +	err = do_mount_sb(&mnt->mnt, flags, data);
> > +	if(err) {
> > +		mnt_free_id(mnt);
> > +		free_vfsmnt(mnt);
> > +		return ERR_PTR(err);
> > +	}
> 
> This duplicates the error handling, we do the same if mount_fs() fails.
> Perhaps you should move these 2 lines into cleanup block and add goto's.
> 
> > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> > +{
> > +	struct inode *inode = d_inode(dentry);
> > +	struct pid *pid = proc_pid(dentry->d_inode);
> > +	struct proc_options *opts = mnt->fs_data;
> > +
> > +	if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> > +		return -ENOENT;
> 
> Hmm. I don't quite understand why do we need this, and how this should work.
> 
> Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
> but only because they both do stat() ?
> 
> Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
> work ?

Yes, you're right! I thought that getattr is called always together with
open(). I wanted to prevent all attempts open() for not-pid directories.

> I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
> out we also need to update proc_sys_dir_file_operations too and may be something
> else...

My main task was to hide all possible direcitrices from the /proc
(in pidonly mode)... even those which we do not know. In this case we
can't rely on the fact that everyone will follow the rules and to
properly handle open().

My current attempt was to force filesystem level check of mountpoint flag.
This is necessary to avoid even the theoretical possibility of ignoring
"pidonly" parameter.

I guess I need to add callback to vfs_open or something to can be sure
that we will not open the wrong file or directory in pidonly mode.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-10 23:46         ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-10 23:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote:
> I can't really review this... but in any case I think you should split
> this patch to separate the vfs and proc changes.
> 
> On 03/07, Alexey Gladkov wrote:
> >
> > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
> >  	mnt->mnt.mnt_sb = root->d_sb;
> >  	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> >  	mnt->mnt_parent = mnt;
> > +
> > +	err = do_mount_sb(&mnt->mnt, flags, data);
> > +	if(err) {
> > +		mnt_free_id(mnt);
> > +		free_vfsmnt(mnt);
> > +		return ERR_PTR(err);
> > +	}
> 
> This duplicates the error handling, we do the same if mount_fs() fails.
> Perhaps you should move these 2 lines into cleanup block and add goto's.
> 
> > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> > +{
> > +	struct inode *inode = d_inode(dentry);
> > +	struct pid *pid = proc_pid(dentry->d_inode);
> > +	struct proc_options *opts = mnt->fs_data;
> > +
> > +	if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> > +		return -ENOENT;
> 
> Hmm. I don't quite understand why do we need this, and how this should work.
> 
> Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
> but only because they both do stat() ?
> 
> Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
> work ?

Yes, you're right! I thought that getattr is called always together with
open(). I wanted to prevent all attempts open() for not-pid directories.

> I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
> out we also need to update proc_sys_dir_file_operations too and may be something
> else...

My main task was to hide all possible direcitrices from the /proc
(in pidonly mode)... even those which we do not know. In this case we
can't rely on the fact that everyone will follow the rules and to
properly handle open().

My current attempt was to force filesystem level check of mountpoint flag.
This is necessary to avoid even the theoretical possibility of ignoring
"pidonly" parameter.

I guess I need to add callback to vfs_open or something to can be sure
that we will not open the wrong file or directory in pidonly mode.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-11  0:05         ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-11  0:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Djalal Harouni, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley, Dmitry V. Levin

On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
> 
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.

Sure, but first I wanted to discuss the idea. My patch isn't very useful
without the ability to add additional files.

I will split it into parts in the new version.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-11  0:05         ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-11  0:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Djalal Harouni, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley, Dmitry V. Levin

On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov <gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
> 
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.

Sure, but first I wanted to discuss the idea. My patch isn't very useful
without the ability to add additional files.

I will split it into parts in the new version.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-11 21:51           ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-11 21:51 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andy Lutomirski, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley, Dmitry V. Levin

On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote:
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html

In fact, nothing has changed. I added a parameter that affects
the mountpoint, not the entire pid namespace.

The procfs will still be global. The device ID will be the same as before.

> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
> 
> Other /proc/self/ns/* comparison and stat() logic...
> 
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Basically we have here a issue because other proc options (hidepid for
example) affect the entire pid namespace, but, I guess, have to affect
the mountpoint.

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime 0 0

# mount -t proc proc /tmp/proc
# mount -o remount,hidepid=2 -t proc proc /tmp/proc

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-11 21:51           ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-11 21:51 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andy Lutomirski, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Al Viro, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley, Dmitry V. Levin

On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote:
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html

In fact, nothing has changed. I added a parameter that affects
the mountpoint, not the entire pid namespace.

The procfs will still be global. The device ID will be the same as before.

> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
> 
> Other /proc/self/ns/* comparison and stat() logic...
> 
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Basically we have here a issue because other proc options (hidepid for
example) affect the entire pid namespace, but, I guess, have to affect
the mountpoint.

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime 0 0

# mount -t proc proc /tmp/proc
# mount -o remount,hidepid=2 -t proc proc /tmp/proc

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-06 23:05     ` Alexey Gladkov
@ 2017-03-12  1:54       ` Al Viro
  -1 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-12  1:54 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 83de8b6..5bd1b84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,6 +1759,8 @@ struct super_operations {
>  	int (*thaw_super) (struct super_block *);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
> +	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> +	int (*mount_fs) (struct vfsmount *, int *, char *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1172cce..4bd364e 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -67,6 +67,7 @@ struct vfsmount {
>  	struct dentry *mnt_root;	/* root of the mounted tree */
>  	struct super_block *mnt_sb;	/* pointer to superblock */
>  	int mnt_flags;
> +	void *fs_data;			/* fs-specific data */

	No.  This is really asking for trouble - you *can't* hang
fs-specific data structures off vfsmount.  Lifetimes make no
sense whatsoever.

	BTW, your patch leaks supreblock reference on failure, and
is kludgy as hell wrt to what it's doing in procfs itself, but
that's secondary - the quoted part is enough for a hard NAK on
architectural grounds.  Don't go there.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-12  1:54       ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-12  1:54 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 83de8b6..5bd1b84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,6 +1759,8 @@ struct super_operations {
>  	int (*thaw_super) (struct super_block *);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
> +	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> +	int (*mount_fs) (struct vfsmount *, int *, char *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1172cce..4bd364e 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -67,6 +67,7 @@ struct vfsmount {
>  	struct dentry *mnt_root;	/* root of the mounted tree */
>  	struct super_block *mnt_sb;	/* pointer to superblock */
>  	int mnt_flags;
> +	void *fs_data;			/* fs-specific data */

	No.  This is really asking for trouble - you *can't* hang
fs-specific data structures off vfsmount.  Lifetimes make no
sense whatsoever.

	BTW, your patch leaks supreblock reference on failure, and
is kludgy as hell wrt to what it's doing in procfs itself, but
that's secondary - the quoted part is enough for a hard NAK on
architectural grounds.  Don't go there.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-12  2:13         ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-12  2:13 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Sun, Mar 12, 2017 at 01:54:38AM +0000, Al Viro wrote:
> On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 83de8b6..5bd1b84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,6 +1759,8 @@ struct super_operations {
> >  	int (*thaw_super) (struct super_block *);
> >  	int (*unfreeze_fs) (struct super_block *);
> >  	int (*statfs) (struct dentry *, struct kstatfs *);
> > +	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> > +	int (*mount_fs) (struct vfsmount *, int *, char *);
> >  	int (*remount_fs) (struct super_block *, int *, char *);
> >  	void (*umount_begin) (struct super_block *);
> >  
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 1172cce..4bd364e 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -67,6 +67,7 @@ struct vfsmount {
> >  	struct dentry *mnt_root;	/* root of the mounted tree */
> >  	struct super_block *mnt_sb;	/* pointer to superblock */
> >  	int mnt_flags;
> > +	void *fs_data;			/* fs-specific data */
> 
> 	No.  This is really asking for trouble - you *can't* hang
> fs-specific data structures off vfsmount.  Lifetimes make no
> sense whatsoever.
>
> 	BTW, your patch leaks supreblock reference on failure, and
> is kludgy as hell wrt to what it's doing in procfs itself, but
> that's secondary - the quoted part is enough for a hard NAK on
> architectural grounds.  Don't go there.

PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
expose the full thing.  And as for the lifetimes making no sense...
note that you are simply not freeing these structures of yours.
Try to handle that and you'll get a serious PITA all over the
place.

What are you trying to achieve, anyway?  Why not add a second vfsmount
pointer per pid_namespace and make it initialized on demand, at the
first attempt of no-pid mount?  Just have a separate no-pid instance
created for those namespaces where it had been asked for, with
separate superblock and dentry tree not containing anything other
that pid-only parts + self + thread-self...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-12  2:13         ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-12  2:13 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley

On Sun, Mar 12, 2017 at 01:54:38AM +0000, Al Viro wrote:
> On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 83de8b6..5bd1b84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,6 +1759,8 @@ struct super_operations {
> >  	int (*thaw_super) (struct super_block *);
> >  	int (*unfreeze_fs) (struct super_block *);
> >  	int (*statfs) (struct dentry *, struct kstatfs *);
> > +	int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> > +	int (*mount_fs) (struct vfsmount *, int *, char *);
> >  	int (*remount_fs) (struct super_block *, int *, char *);
> >  	void (*umount_begin) (struct super_block *);
> >  
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 1172cce..4bd364e 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -67,6 +67,7 @@ struct vfsmount {
> >  	struct dentry *mnt_root;	/* root of the mounted tree */
> >  	struct super_block *mnt_sb;	/* pointer to superblock */
> >  	int mnt_flags;
> > +	void *fs_data;			/* fs-specific data */
> 
> 	No.  This is really asking for trouble - you *can't* hang
> fs-specific data structures off vfsmount.  Lifetimes make no
> sense whatsoever.
>
> 	BTW, your patch leaks supreblock reference on failure, and
> is kludgy as hell wrt to what it's doing in procfs itself, but
> that's secondary - the quoted part is enough for a hard NAK on
> architectural grounds.  Don't go there.

PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
expose the full thing.  And as for the lifetimes making no sense...
note that you are simply not freeing these structures of yours.
Try to handle that and you'll get a serious PITA all over the
place.

What are you trying to achieve, anyway?  Why not add a second vfsmount
pointer per pid_namespace and make it initialized on demand, at the
first attempt of no-pid mount?  Just have a separate no-pid instance
created for those namespaces where it had been asked for, with
separate superblock and dentry tree not containing anything other
that pid-only parts + self + thread-self...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13  3:19           ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-13  3:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> expose the full thing.  And as for the lifetimes making no sense...
> note that you are simply not freeing these structures of yours.
> Try to handle that and you'll get a serious PITA all over the
> place.
>
> What are you trying to achieve, anyway?  Why not add a second vfsmount
> pointer per pid_namespace and make it initialized on demand, at the
> first attempt of no-pid mount?  Just have a separate no-pid instance
> created for those namespaces where it had been asked for, with
> separate superblock and dentry tree not containing anything other
> that pid-only parts + self + thread-self...

Can't we just make procfs work like most other filesystems and have
each mount have its own superblock?  If we need to do something funky
to stat() output to keep existing userspace working, I think that's
okay.

As far as I can tell, proc_mnt is very nearly useless -- it seems to
be used for proc_flush_task (which claims to be purely an optimization
and could be preserved in the common case where there's only one
relevant mount) and for sysctl_binary.  For the latter, we could
create proc_mnt but make actual user-initiated mounts be new
superblocks anyway.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13  3:19           ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-13  3:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> expose the full thing.  And as for the lifetimes making no sense...
> note that you are simply not freeing these structures of yours.
> Try to handle that and you'll get a serious PITA all over the
> place.
>
> What are you trying to achieve, anyway?  Why not add a second vfsmount
> pointer per pid_namespace and make it initialized on demand, at the
> first attempt of no-pid mount?  Just have a separate no-pid instance
> created for those namespaces where it had been asked for, with
> separate superblock and dentry tree not containing anything other
> that pid-only parts + self + thread-self...

Can't we just make procfs work like most other filesystems and have
each mount have its own superblock?  If we need to do something funky
to stat() output to keep existing userspace working, I think that's
okay.

As far as I can tell, proc_mnt is very nearly useless -- it seems to
be used for proc_flush_task (which claims to be purely an optimization
and could be preserved in the common case where there's only one
relevant mount) and for sysctl_binary.  For the latter, we could
create proc_mnt but make actual user-initiated mounts be new
superblocks anyway.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13 13:27             ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-13 13:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> > expose the full thing.  And as for the lifetimes making no sense...
> > note that you are simply not freeing these structures of yours.
> > Try to handle that and you'll get a serious PITA all over the
> > place.
> >
> > What are you trying to achieve, anyway?  Why not add a second vfsmount
> > pointer per pid_namespace and make it initialized on demand, at the
> > first attempt of no-pid mount?  Just have a separate no-pid instance
> > created for those namespaces where it had been asked for, with
> > separate superblock and dentry tree not containing anything other
> > that pid-only parts + self + thread-self...
> 
> Can't we just make procfs work like most other filesystems and have
> each mount have its own superblock?  If we need to do something funky
> to stat() output to keep existing userspace working, I think that's
> okay.

First of all, most of the filesystems do *NOT* guarantee anything of
that sort.  And what's the point of having more instances than
necessary, anyway?

> As far as I can tell, proc_mnt is very nearly useless -- it seems to
> be used for proc_flush_task (which claims to be purely an optimization
> and could be preserved in the common case where there's only one
> relevant mount) and for sysctl_binary.  For the latter, we could
> create proc_mnt but make actual user-initiated mounts be new
> superblocks anyway.

Again, what for?  It won't salvage that kludge...  It's not as if it
had been hard to have separate pid-only instance created when asked
for (and reused every time when we are asked for pid-only).  What's
the point of ever having more than two instances per pidns?  IDGI...

Folks, there is no one-to-one correspondence between mountpoints and
superblocks.  Not since 2000 or so.  Just don't try to shove your
per-superblock stuff into vfsmount; it simply won't work.  If you
want a separate instance for that thing, then just go ahead and
have ->mount() decide which one to use (and whether to create a new
one).  All there is to it...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13 13:27             ` Al Viro
  0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2017-03-13 13:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> > expose the full thing.  And as for the lifetimes making no sense...
> > note that you are simply not freeing these structures of yours.
> > Try to handle that and you'll get a serious PITA all over the
> > place.
> >
> > What are you trying to achieve, anyway?  Why not add a second vfsmount
> > pointer per pid_namespace and make it initialized on demand, at the
> > first attempt of no-pid mount?  Just have a separate no-pid instance
> > created for those namespaces where it had been asked for, with
> > separate superblock and dentry tree not containing anything other
> > that pid-only parts + self + thread-self...
> 
> Can't we just make procfs work like most other filesystems and have
> each mount have its own superblock?  If we need to do something funky
> to stat() output to keep existing userspace working, I think that's
> okay.

First of all, most of the filesystems do *NOT* guarantee anything of
that sort.  And what's the point of having more instances than
necessary, anyway?

> As far as I can tell, proc_mnt is very nearly useless -- it seems to
> be used for proc_flush_task (which claims to be purely an optimization
> and could be preserved in the common case where there's only one
> relevant mount) and for sysctl_binary.  For the latter, we could
> create proc_mnt but make actual user-initiated mounts be new
> superblocks anyway.

Again, what for?  It won't salvage that kludge...  It's not as if it
had been hard to have separate pid-only instance created when asked
for (and reused every time when we are asked for pid-only).  What's
the point of ever having more than two instances per pidns?  IDGI...

Folks, there is no one-to-one correspondence between mountpoints and
superblocks.  Not since 2000 or so.  Just don't try to shove your
per-superblock stuff into vfsmount; it simply won't work.  If you
want a separate instance for that thing, then just go ahead and
have ->mount() decide which one to use (and whether to create a new
one).  All there is to it...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13 15:24               ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-13 15:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Mon, Mar 13, 2017 at 6:27 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
>> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
>> > expose the full thing.  And as for the lifetimes making no sense...
>> > note that you are simply not freeing these structures of yours.
>> > Try to handle that and you'll get a serious PITA all over the
>> > place.
>> >
>> > What are you trying to achieve, anyway?  Why not add a second vfsmount
>> > pointer per pid_namespace and make it initialized on demand, at the
>> > first attempt of no-pid mount?  Just have a separate no-pid instance
>> > created for those namespaces where it had been asked for, with
>> > separate superblock and dentry tree not containing anything other
>> > that pid-only parts + self + thread-self...
>>
>> Can't we just make procfs work like most other filesystems and have
>> each mount have its own superblock?  If we need to do something funky
>> to stat() output to keep existing userspace working, I think that's
>> okay.
>
> First of all, most of the filesystems do *NOT* guarantee anything of
> that sort.  And what's the point of having more instances than
> necessary, anyway?

I mean that, if I do:

mount -t proc -o foobar none a
mount -t proc -o baz none b

Then I think that the second mount should create a whole new proc
instance rather than just a new vfsmount.  Then the options could
differ, which would solve a bunch of problems.

>
> Again, what for?  It won't salvage that kludge...  It's not as if it
> had been hard to have separate pid-only instance created when asked
> for (and reused every time when we are asked for pid-only).  What's
> the point of ever having more than two instances per pidns?  IDGI...

I can easily procfs growing more than one interesting option.
Pid-only and hidepid come to mind, and that's already six possible
combinations.  The current hidepid implementation is really awful.

>
> Folks, there is no one-to-one correspondence between mountpoints and
> superblocks.  Not since 2000 or so.  Just don't try to shove your
> per-superblock stuff into vfsmount; it simply won't work.  If you
> want a separate instance for that thing, then just go ahead and
> have ->mount() decide which one to use (and whether to create a new
> one).  All there is to it...

That's what I mean.  I just don't see the point of going all-out in
trying to reuse superblocks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-13 15:24               ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2017-03-13 15:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Alexey Gladkov, Linux Kernel Mailing List, Linux API,
	Kirill A. Shutemov, Vasiliy Kulikov, Eric W. Biederman,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley

On Mon, Mar 13, 2017 at 6:27 AM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
>> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
>> > expose the full thing.  And as for the lifetimes making no sense...
>> > note that you are simply not freeing these structures of yours.
>> > Try to handle that and you'll get a serious PITA all over the
>> > place.
>> >
>> > What are you trying to achieve, anyway?  Why not add a second vfsmount
>> > pointer per pid_namespace and make it initialized on demand, at the
>> > first attempt of no-pid mount?  Just have a separate no-pid instance
>> > created for those namespaces where it had been asked for, with
>> > separate superblock and dentry tree not containing anything other
>> > that pid-only parts + self + thread-self...
>>
>> Can't we just make procfs work like most other filesystems and have
>> each mount have its own superblock?  If we need to do something funky
>> to stat() output to keep existing userspace working, I think that's
>> okay.
>
> First of all, most of the filesystems do *NOT* guarantee anything of
> that sort.  And what's the point of having more instances than
> necessary, anyway?

I mean that, if I do:

mount -t proc -o foobar none a
mount -t proc -o baz none b

Then I think that the second mount should create a whole new proc
instance rather than just a new vfsmount.  Then the options could
differ, which would solve a bunch of problems.

>
> Again, what for?  It won't salvage that kludge...  It's not as if it
> had been hard to have separate pid-only instance created when asked
> for (and reused every time when we are asked for pid-only).  What's
> the point of ever having more than two instances per pidns?  IDGI...

I can easily procfs growing more than one interesting option.
Pid-only and hidepid come to mind, and that's already six possible
combinations.  The current hidepid implementation is really awful.

>
> Folks, there is no one-to-one correspondence between mountpoints and
> superblocks.  Not since 2000 or so.  Just don't try to shove your
> per-superblock stuff into vfsmount; it simply won't work.  If you
> want a separate instance for that thing, then just go ahead and
> have ->mount() decide which one to use (and whether to create a new
> one).  All there is to it...

That's what I mean.  I just don't see the point of going all-out in
trying to reuse superblocks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC] Add option to mount only a pids subset
  2017-03-12  2:13         ` Al Viro
  (?)
  (?)
@ 2017-03-20 12:58         ` Alexey Gladkov
  2017-03-23 16:05             ` Oleg Nesterov
  2017-03-23 16:06             ` Djalal Harouni
  -1 siblings, 2 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-20 12:58 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linux API, Kirill A. Shutemov, Vasiliy Kulikov, Al Viro,
	Eric W. Biederman, Oleg Nesterov, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin


Al Viro, this patch looks better ?

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids without
rest objects. Since this is an option for /proc, flags applied to /proc
have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid
sunset and additional required files will be mounted on top using the
overlayfs.

But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

* Add overlayfs support.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/generic.c             |   5 ++
 fs/proc/inode.c               |   2 +
 fs/proc/internal.h            |   7 +++
 fs/proc/root.c                | 113 ++++++++++++++++++++++++++++++++++++++++--
 include/linux/pid_namespace.h |   1 +
 5 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..50bb1e9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -23,6 +23,7 @@
 #include <linux/spinlock.h>
 #include <linux/completion.h>
 #include <linux/uaccess.h>
+#include <linux/pid_namespace.h>
 
 #include "internal.h"
 
@@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
+	struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+	if (ns->pidfs && inode == d_inode(ns->pidfs))
+		return 1;
 
 	return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..0c9be65 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
 	if (pid->hide_pid != HIDEPID_OFF)
 		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+	if (root == pid->pidfs)
+		seq_printf(seq, ",pidonly");
 
 	return 0;
 }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b..a5a4bf1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+struct proc_options {
+	kgid_t pid_gid;
+	int hide_pid;
+	int pid_only;
+};
+
 extern struct proc_dir_entry proc_root;
+extern struct proc_dir_entry pidfs_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..c2443d5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -26,16 +26,17 @@
 #include "internal.h"
 
 enum {
-	Opt_gid, Opt_hidepid, Opt_err,
+	Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
 };
 
 static const match_table_t tokens = {
 	{Opt_hidepid, "hidepid=%u"},
 	{Opt_gid, "gid=%u"},
+	{Opt_pidonly, "pidonly"},
 	{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
 {
 	char *p;
 	substring_t args[MAX_OPT_ARGS];
@@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 		case Opt_gid:
 			if (match_int(&args[0], &option))
 				return 0;
-			pid->pid_gid = make_kgid(current_user_ns(), option);
+			fs_opts->pid_gid = make_kgid(current_user_ns(), option);
 			break;
 		case Opt_hidepid:
 			if (match_int(&args[0], &option))
@@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 				pr_err("proc: hidepid value must be between 0 and 2.\n");
 				return 0;
 			}
-			pid->hide_pid = option;
+			fs_opts->hide_pid = option;
+			break;
+		case Opt_pidonly:
+			fs_opts->pid_only = 1;
 			break;
 		default:
 			pr_err("proc: unrecognized mount option \"%s\" "
@@ -77,6 +81,72 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+	struct proc_options opts = { 0 };
+
+	if (!proc_fill_options(options, &opts))
+		return 0;
+
+	pid->pid_gid = opts.pid_gid;
+	pid->hide_pid = opts.hide_pid;
+
+	return 1;
+}
+
+static int pidfs_register_dir(struct dentry *root, char *name, struct inode *inode)
+{
+	struct inode *root_inode = d_inode(root);
+	struct dentry *child;
+
+	inode_lock(root_inode);
+	child = d_alloc_name(root, name);
+	if (child) {
+		d_add(child, inode);
+	} else {
+		child = ERR_PTR(-ENOMEM);
+	}
+	inode_unlock(root_inode);
+	if (IS_ERR(child)) {
+		pr_err("pidfs_register_dir: can't allocate /pidfs/%s\n", name);
+		return PTR_ERR(child);
+	}
+	return 0;
+}
+
+static int fill_pidfs_root(struct super_block *s)
+{
+	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct inode *root_inode;
+	struct dentry *root;
+	int ret;
+
+	pde_get(&pidfs_root);
+	root_inode = proc_get_inode(s, &pidfs_root);
+	if (!root_inode) {
+		pr_err("pidfs_fill_root: get root inode failed\n");
+		return -ENOMEM;
+	}
+
+	root = d_make_root(root_inode);
+	if (!root) {
+		pr_err("pidfs_fill_root: allocate dentry failed\n");
+		return -ENOMEM;
+	}
+
+	ret = pidfs_register_dir(root, "self", d_inode(ns->proc_self));
+	if (ret)
+		return ret;
+
+	ret = pidfs_register_dir(root, "thread-self", d_inode(ns->proc_thread_self));
+	if (ret)
+		return ret;
+
+	ns->pidfs = root;
+
+	return 0;
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
 	struct pid_namespace *pid = sb->s_fs_info;
@@ -89,6 +159,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	struct pid_namespace *ns;
+	static struct dentry *root;
+	struct proc_options opts = { 0 };
 
 	if (flags & MS_KERNMOUNT) {
 		ns = data;
@@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 		ns = task_active_pid_ns(current);
 	}
 
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+	root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+
+	if (!IS_ERR(root)) {
+		if (!proc_fill_options(data, &opts))
+			return ERR_PTR(-EINVAL);
+
+		if (opts.pid_only) {
+			int ret;
+
+			if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
+				return ERR_PTR(ret);
+
+			root = ns->pidfs;
+		}
+	}
+
+	return root;
 }
 
 static void proc_kill_sb(struct super_block *sb)
@@ -109,6 +197,8 @@ static void proc_kill_sb(struct super_block *sb)
 		dput(ns->proc_self);
 	if (ns->proc_thread_self)
 		dput(ns->proc_thread_self);
+	if (ns->pidfs)
+		dput(ns->pidfs);
 	kill_anon_super(sb);
 	put_pid_ns(ns);
 }
@@ -214,6 +304,19 @@ struct proc_dir_entry proc_root = {
 	.name		= "/proc",
 };
 
+struct proc_dir_entry pidfs_root = {
+	.low_ino	= PROC_ROOT_INO,
+	.namelen	= 5,
+	.mode		= S_IFDIR | S_IRUGO | S_IXUGO,
+	.nlink		= 2,
+	.count		= ATOMIC_INIT(1),
+	.proc_iops	= &proc_root_inode_operations,
+	.proc_fops	= &proc_root_operations,
+	.parent		= &pidfs_root,
+	.subdir		= RB_ROOT,
+	.name		= "/pidfs",
+};
+
 int pid_ns_prepare_proc(struct pid_namespace *ns)
 {
 	struct vfsmount *mnt;
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c2a989d..e7b4d64 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -41,6 +41,7 @@ struct pid_namespace {
 	struct vfsmount *proc_mnt;
 	struct dentry *proc_self;
 	struct dentry *proc_thread_self;
+	struct dentry *pidfs;
 #endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH] proc: allow to change proc mount options per mount
@ 2017-03-23 15:59                 ` Djalal Harouni
  0 siblings, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-23 15:59 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Andy Lutomirski, Alexey Gladkov
  Cc: Al Viro, Linux API, kirill, Vasiliy Kulikov, ebiederm,
	Oleg Nesterov, Pavel Emelyanov, James Bottomley, Kees Cook,
	Dongsu Park, Djalal Harouni

As per the discussion with Andy, and following what Al Viro suggested
maybe this can work ? the patch is still buggy on top of Linus' tree 093b995e3b

Currently hidepid mount option is propagated to all proc mounts that are
in the same pid namespace. This patch make it possible to have proc
mounts with different options inside the same pid namespace.

Since this may break userspace or code that checks/expects some device
identifiers, this mode is only supported when theo ption "version=2" is
passed.

[Buggy patch]

[tixxdz@fedora-kvm linux]$ sudo mount -t proc none /test
[tixxdz@fedora-kvm linux]$ sudo mount -t proc -o hidepid=2,version=2 none /test2
[tixxdz@fedora-kvm linux]$ ls -l /proc | head -6
total 0
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 1
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 10
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 100
dr-xr-xr-x.  9 gdm     gdm                   0 Mar 23 17:25 1005
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test | head -16
total 0
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 1
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 10
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 100
dr-xr-xr-x.  9 gdm     gdm                   0 Mar 23 17:25 1005
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test2 | head -6
total 0
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1182
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1197
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1199
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1222
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1225


Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
---
 fs/locks.c              |   6 +-
 fs/proc/base.c          |  51 +++++++++-----
 fs/proc/generic.c       |   5 ++
 fs/proc/inode.c         |  10 +--
 fs/proc/root.c          | 181 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/proc/self.c          |   8 ++-
 fs/proc/thread_self.c   |   6 +-
 fs/proc_namespace.c     |  14 ++--
 include/linux/proc_fs.h |   9 +++
 9 files changed, 248 insertions(+), 42 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 2681132..dab5058 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2617,7 +2617,8 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 	unsigned int fl_pid;
 
 	if (fl->fl_nspid) {
-		struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+		struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+		struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 		/* Don't let fl_pid change based on who is reading the file */
 		fl_pid = pid_nr_ns(fl->fl_nspid, proc_pidns);
@@ -2701,7 +2702,8 @@ static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+	struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c87b6b9..74b389d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -681,13 +681,24 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (pid->hide_pid < hide_pid_min)
+	int hide_pid;
+	kgid_t pid_gid;
+	int version = fs_info->version;
+
+	if (version == 2) {
+		hide_pid = fs_info->hide_pid;
+		pid_gid = fs_info->pid_gid;
+	} else {
+		hide_pid = fs_info->pid_ns->hide_pid;
+		pid_gid = fs_info->pid_ns->pid_gid;
+	}
+	if (hide_pid < hide_pid_min)
 		return true;
-	if (in_group_p(pid->pid_gid))
+	if (in_group_p(pid_gid))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -695,14 +706,15 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-	struct pid_namespace *pid = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 	struct task_struct *task;
 	bool has_perms;
 
 	task = get_proc_task(inode);
 	if (!task)
 		return -ESRCH;
-	has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+	has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
 	put_task_struct(task);
 
 	if (!has_perms) {
@@ -730,12 +742,12 @@ static const struct inode_operations proc_def_inode_operations = {
 static int proc_single_show(struct seq_file *m, void *v)
 {
 	struct inode *inode = m->private;
-	struct pid_namespace *ns;
 	struct pid *pid;
 	struct task_struct *task;
 	int ret;
 
-	ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid = proc_pid(inode);
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task)
@@ -1734,7 +1746,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 {
 	struct inode *inode = d_inode(path->dentry);
 	struct task_struct *task;
-	struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
 
 	generic_fillattr(inode, stat);
 
@@ -1743,7 +1755,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 	stat->gid = GLOBAL_ROOT_GID;
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
-		if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
 			rcu_read_unlock();
 			/*
 			 * This doesn't prevent learning whether PID exists,
@@ -2249,6 +2261,8 @@ static const struct seq_operations proc_timers_seq_ops = {
 static int proc_timers_open(struct inode *inode, struct file *file)
 {
 	struct timers_private *tp;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	tp = __seq_open_private(file, &proc_timers_seq_ops,
 			sizeof(struct timers_private));
@@ -2256,7 +2270,7 @@ static int proc_timers_open(struct inode *inode, struct file *file)
 		return -ENOMEM;
 
 	tp->pid = proc_pid(inode);
-	tp->ns = inode->i_sb->s_fs_info;
+	tp->ns = ns;
 	return 0;
 }
 
@@ -3077,13 +3091,13 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsign
 	int result = -ENOENT;
 	struct task_struct *task;
 	unsigned tgid;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	tgid = name_to_int(&dentry->d_name);
 	if (tgid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
@@ -3147,7 +3161,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
-	struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(file_inode(file)->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	loff_t pos = ctx->pos;
 
 	if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
@@ -3174,7 +3189,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		int len;
 
 		cond_resched();
-		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
 		len = snprintf(name, sizeof(name), "%d", iter.tgid);
@@ -3371,7 +3386,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
 	unsigned tid;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(dentry->d_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	if (!leader)
 		goto out_no_task;
@@ -3380,7 +3396,6 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (tid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tid, ns);
 	if (task)
@@ -3482,7 +3497,8 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
 	struct task_struct *task;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	int tid;
 
 	if (proc_inode_is_dead(inode))
@@ -3494,7 +3510,6 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 	/* f_version caches the tgid value that the last readdir call couldn't
 	 * return. lseek aka telldir automagically resets f_version to 0.
 	 */
-	ns = inode->i_sb->s_fs_info;
 	tid = (int)file->f_version;
 	file->f_version = 0;
 	for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..49c8cb9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -28,6 +28,11 @@
 
 static DEFINE_RWLOCK(proc_subdir_lock);
 
+struct proc_fs_info *proc_sb(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
 {
 	if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..4743943 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,7 +103,8 @@ void __init proc_init_inodecache(void)
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct super_block *sb = root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 
 	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
@@ -473,11 +474,12 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
 	struct inode *root_inode;
 	int ret;
 
-	if (!proc_parse_options(data, ns))
+	if (fs_info->version == 1 && !proc_parse_options(data, ns))
 		return -EINVAL;
 
 	/* User space would break if executables or devices appear on proc */
@@ -495,7 +497,7 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	pde_get(&proc_root);
 	root_inode = proc_get_inode(s, &proc_root);
 	if (!root_inode) {
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..d4047ef 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/sched/stat.h>
+#include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/user_namespace.h>
@@ -26,12 +27,13 @@
 #include "internal.h"
 
 enum {
-	Opt_gid, Opt_hidepid, Opt_err,
+	Opt_gid, Opt_hidepid, Opt_version ,Opt_err,
 };
 
 static const match_table_t tokens = {
 	{Opt_hidepid, "hidepid=%u"},
 	{Opt_gid, "gid=%u"},
+	{Opt_version, "version=%u"},
 	{Opt_err, NULL},
 };
 
@@ -67,6 +69,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 			}
 			pid->hide_pid = option;
 			break;
+		case Opt_version:
+			break;
 		default:
 			pr_err("proc: unrecognized mount option \"%s\" "
 			       "or missing value\n", p);
@@ -77,40 +81,205 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+struct proc_options {
+	int version;	/* version field auto set to 1 to not break userspace */
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
+int proc_parse_early_options(char *options, void *holder,
+			     struct proc_options *fs_options)
+{
+	char *p, *opts, *orig;
+	substring_t args[MAX_OPT_ARGS];
+	int option, ret = 0;
+
+	if (!options)
+		return 0;
+
+	opts = kstrdup(options, GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	orig = opts;
+
+	while ((p = strsep(&opts, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_version:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (option < 1 || option > 2) {
+				pr_err("proc: version value must be 1 or 2.\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->version = option;
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->pid_gid = make_kgid(current_user_ns(), option);
+			break;
+		case Opt_hidepid:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (option < 0 || option > 2) {
+				pr_err("proc: hidepid value must be between 0 and 2.\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->hide_pid = option;
+			break;
+		case Opt_err:
+			/*
+			 * pr_err("proc: unrecognized mount option \"%s\" ", p);
+			 * ret = -EINVAL;
+			 * goto out;
+			 */
+		default:
+			break;
+		}
+	}
+
+out:
+	kfree(orig);
+	return ret;
+}
+
+static int proc_test_super(struct super_block *s, void *data)
+{
+	int ret = 0;
+	struct proc_fs_info *p = data;
+	struct proc_fs_info *fs_info = proc_sb(s);
+
+	if (p->version == 1 && p->pid_ns == fs_info->pid_ns)
+		ret = 1;
+	/*
+	if (p->version == 2 && p->pid_ns == fs_info->pid_ns &&
+	    p->hide_pid == fs_info->hide_pid &&
+	    gid_eq(p->pid_gid, fs_info->pid_gid))
+		ret = 1;
+	*/
+	return ret;
+}
+
+static int proc_set_super(struct super_block *sb, void *data)
+{
+	sb->s_fs_info = data;
+	return set_anon_super(sb, NULL);
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
-	struct pid_namespace *pid = sb->s_fs_info;
+	int error;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
+	struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };
 
 	sync_filesystem(sb);
-	return !proc_parse_options(data, pid);
+	error = proc_parse_early_options(data, sb->s_type, &fs_options);
+	if (error < 0)
+		return error;
+
+	if (fs_options.version == 1) {
+		error = proc_parse_options(data, pid);
+		if (!error)
+			return -EINVAL;
+	}
+
+	fs_info->version = fs_options.version;
+	fs_info->pid_gid = fs_options.pid_gid;
+	fs_info->hide_pid = fs_options.hide_pid;
+
+	return 0;
 }
 
 static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	struct pid_namespace *ns;
+	struct super_block *sb;
+	struct proc_fs_info *fs_info = NULL;
+	struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };
+	int error = 0;
+
+	if (!(flags & MS_KERNMOUNT)) {
+		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+			return ERR_PTR(-EPERM);
+
+
+		error = proc_parse_early_options(data, fs_type, &fs_options);
+		if (error)
+			return ERR_PTR(error);
+	}
+
+	fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_NOFS);
+	if (!fs_info)
+		return ERR_PTR(-ENOMEM);
+
+	fs_info->version = fs_options.version;
+	fs_info->pid_gid = fs_options.pid_gid;
+	fs_info->hide_pid = fs_options.hide_pid;
 
 	if (flags & MS_KERNMOUNT) {
 		ns = data;
 		data = NULL;
+		fs_info->version = 1; /* Lets restore this */
 	} else {
 		ns = task_active_pid_ns(current);
 	}
 
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+	fs_info->pid_ns = ns;
+
+	sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+			 ns->user_ns, fs_info);
+	if (IS_ERR(sb)) {
+		error = PTR_ERR(sb);
+		goto error_fs_info;
+	}
+
+	if (sb->s_root) {
+		kfree(fs_info);
+	} else {
+		error = proc_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+		if (error) {
+			deactivate_locked_super(sb);
+			goto error_fs_info;
+		}
+
+		sb->s_flags |= MS_ACTIVE;
+	}
+
+	return dget(sb->s_root);
+
+error_fs_info:
+	kfree(fs_info);
+	return ERR_PTR(error);
 }
 
 static void proc_kill_sb(struct super_block *sb)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *ns = (struct pid_namespace *)fs_info->pid_ns;
 
-	ns = (struct pid_namespace *)sb->s_fs_info;
 	if (ns->proc_self)
 		dput(ns->proc_self);
 	if (ns->proc_thread_self)
 		dput(ns->proc_thread_self);
 	kill_anon_super(sb);
 	put_pid_ns(ns);
+	kfree(fs_info);
 }
 
 static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 39857f6..9f95174 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -10,7 +10,8 @@ static const char *proc_self_get_link(struct dentry *dentry,
 				      struct inode *inode,
 				      struct delayed_call *done)
 {
-	struct pid_namespace *ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid_t tgid = task_tgid_nr_ns(current, ns);
 	char *name;
 
@@ -34,9 +35,10 @@ static unsigned self_inum;
 int proc_setup_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = s->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	struct dentry *self;
-	
+
 	inode_lock(root_inode);
 	self = d_alloc_name(s->s_root, "self");
 	if (self) {
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 20614b6..13d9aef 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -10,7 +10,8 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
 					     struct inode *inode,
 					     struct delayed_call *done)
 {
-	struct pid_namespace *ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid_t tgid = task_tgid_nr_ns(current, ns);
 	pid_t pid = task_pid_nr_ns(current, ns);
 	char *name;
@@ -34,8 +35,9 @@ static unsigned thread_self_inum;
 
 int proc_setup_thread_self(struct super_block *s)
 {
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = s->s_fs_info;
 	struct dentry *thread_self;
 
 	inode_lock(root_inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index b5713fe..d0ae937 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -36,23 +36,23 @@ static unsigned mounts_poll(struct file *file, poll_table *wait)
 	return res;
 }
 
-struct proc_fs_info {
+struct proc_fs_opts {
 	int flag;
 	const char *str;
 };
 
 static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 {
-	static const struct proc_fs_info fs_info[] = {
+	static const struct proc_fs_opts fs_opts[] = {
 		{ MS_SYNCHRONOUS, ",sync" },
 		{ MS_DIRSYNC, ",dirsync" },
 		{ MS_MANDLOCK, ",mand" },
 		{ MS_LAZYTIME, ",lazytime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
@@ -62,7 +62,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 
 static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 {
-	static const struct proc_fs_info mnt_info[] = {
+	static const struct proc_fs_opts mnt_opts[] = {
 		{ MNT_NOSUID, ",nosuid" },
 		{ MNT_NODEV, ",nodev" },
 		{ MNT_NOEXEC, ",noexec" },
@@ -71,9 +71,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_RELATIME, ",relatime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2d2bf59..27a1e85 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -7,10 +7,18 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 
+struct proc_fs_info {
+	struct pid_namespace *pid_ns;
+	int version;
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
 struct proc_dir_entry;
 
 #ifdef CONFIG_PROC_FS
 
+extern struct proc_fs_info *proc_sb(struct super_block *sb);
 extern void proc_root_init(void);
 extern void proc_flush_task(struct task_struct *);
 
@@ -53,6 +61,7 @@ static inline void proc_flush_task(struct task_struct *task)
 {
 }
 
+extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
 static inline struct proc_dir_entry *proc_symlink(const char *name,
 		struct proc_dir_entry *parent,const char *dest) { return NULL;}
 static inline struct proc_dir_entry *proc_mkdir(const char *name,
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH] proc: allow to change proc mount options per mount
@ 2017-03-23 15:59                 ` Djalal Harouni
  0 siblings, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-23 15:59 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Andy Lutomirski, Alexey Gladkov
  Cc: Al Viro, Linux API, kirill-oKw7cIdHH8eLwutG50LtGA,
	Vasiliy Kulikov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Kees Cook, Dongsu Park,
	Djalal Harouni

As per the discussion with Andy, and following what Al Viro suggested
maybe this can work ? the patch is still buggy on top of Linus' tree 093b995e3b

Currently hidepid mount option is propagated to all proc mounts that are
in the same pid namespace. This patch make it possible to have proc
mounts with different options inside the same pid namespace.

Since this may break userspace or code that checks/expects some device
identifiers, this mode is only supported when theo ption "version=2" is
passed.

[Buggy patch]

[tixxdz@fedora-kvm linux]$ sudo mount -t proc none /test
[tixxdz@fedora-kvm linux]$ sudo mount -t proc -o hidepid=2,version=2 none /test2
[tixxdz@fedora-kvm linux]$ ls -l /proc | head -6
total 0
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 1
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 10
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 100
dr-xr-xr-x.  9 gdm     gdm                   0 Mar 23 17:25 1005
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test | head -16
total 0
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 1
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 10
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 100
dr-xr-xr-x.  9 gdm     gdm                   0 Mar 23 17:25 1005
dr-xr-xr-x.  9 root    root                  0 Mar 23 17:25 101
[tixxdz@fedora-kvm linux]$ ls -l /test2 | head -6
total 0
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1182
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1197
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1199
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1222
dr-xr-xr-x.  9 tixxdz tixxdz               0 Mar 23 17:27 1225


Signed-off-by: Djalal Harouni <tixxdz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 fs/locks.c              |   6 +-
 fs/proc/base.c          |  51 +++++++++-----
 fs/proc/generic.c       |   5 ++
 fs/proc/inode.c         |  10 +--
 fs/proc/root.c          | 181 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/proc/self.c          |   8 ++-
 fs/proc/thread_self.c   |   6 +-
 fs/proc_namespace.c     |  14 ++--
 include/linux/proc_fs.h |   9 +++
 9 files changed, 248 insertions(+), 42 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 2681132..dab5058 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2617,7 +2617,8 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 	unsigned int fl_pid;
 
 	if (fl->fl_nspid) {
-		struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+		struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+		struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 		/* Don't let fl_pid change based on who is reading the file */
 		fl_pid = pid_nr_ns(fl->fl_nspid, proc_pidns);
@@ -2701,7 +2702,8 @@ static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(file_inode(f->file)->i_sb);
+	struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c87b6b9..74b389d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -681,13 +681,24 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (pid->hide_pid < hide_pid_min)
+	int hide_pid;
+	kgid_t pid_gid;
+	int version = fs_info->version;
+
+	if (version == 2) {
+		hide_pid = fs_info->hide_pid;
+		pid_gid = fs_info->pid_gid;
+	} else {
+		hide_pid = fs_info->pid_ns->hide_pid;
+		pid_gid = fs_info->pid_ns->pid_gid;
+	}
+	if (hide_pid < hide_pid_min)
 		return true;
-	if (in_group_p(pid->pid_gid))
+	if (in_group_p(pid_gid))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -695,14 +706,15 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-	struct pid_namespace *pid = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 	struct task_struct *task;
 	bool has_perms;
 
 	task = get_proc_task(inode);
 	if (!task)
 		return -ESRCH;
-	has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+	has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
 	put_task_struct(task);
 
 	if (!has_perms) {
@@ -730,12 +742,12 @@ static const struct inode_operations proc_def_inode_operations = {
 static int proc_single_show(struct seq_file *m, void *v)
 {
 	struct inode *inode = m->private;
-	struct pid_namespace *ns;
 	struct pid *pid;
 	struct task_struct *task;
 	int ret;
 
-	ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid = proc_pid(inode);
 	task = get_pid_task(pid, PIDTYPE_PID);
 	if (!task)
@@ -1734,7 +1746,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 {
 	struct inode *inode = d_inode(path->dentry);
 	struct task_struct *task;
-	struct pid_namespace *pid = path->dentry->d_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
 
 	generic_fillattr(inode, stat);
 
@@ -1743,7 +1755,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 	stat->gid = GLOBAL_ROOT_GID;
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
-		if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
 			rcu_read_unlock();
 			/*
 			 * This doesn't prevent learning whether PID exists,
@@ -2249,6 +2261,8 @@ static const struct seq_operations proc_timers_seq_ops = {
 static int proc_timers_open(struct inode *inode, struct file *file)
 {
 	struct timers_private *tp;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	tp = __seq_open_private(file, &proc_timers_seq_ops,
 			sizeof(struct timers_private));
@@ -2256,7 +2270,7 @@ static int proc_timers_open(struct inode *inode, struct file *file)
 		return -ENOMEM;
 
 	tp->pid = proc_pid(inode);
-	tp->ns = inode->i_sb->s_fs_info;
+	tp->ns = ns;
 	return 0;
 }
 
@@ -3077,13 +3091,13 @@ struct dentry *proc_pid_lookup(struct inode *dir, struct dentry * dentry, unsign
 	int result = -ENOENT;
 	struct task_struct *task;
 	unsigned tgid;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(dir->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	tgid = name_to_int(&dentry->d_name);
 	if (tgid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
@@ -3147,7 +3161,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
-	struct pid_namespace *ns = file_inode(file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(file_inode(file)->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	loff_t pos = ctx->pos;
 
 	if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
@@ -3174,7 +3189,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		int len;
 
 		cond_resched();
-		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
 		len = snprintf(name, sizeof(name), "%d", iter.tgid);
@@ -3371,7 +3386,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
 	unsigned tid;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(dentry->d_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 
 	if (!leader)
 		goto out_no_task;
@@ -3380,7 +3396,6 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (tid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tid, ns);
 	if (task)
@@ -3482,7 +3497,8 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
 	struct task_struct *task;
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	int tid;
 
 	if (proc_inode_is_dead(inode))
@@ -3494,7 +3510,6 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 	/* f_version caches the tgid value that the last readdir call couldn't
 	 * return. lseek aka telldir automagically resets f_version to 0.
 	 */
-	ns = inode->i_sb->s_fs_info;
 	tid = (int)file->f_version;
 	file->f_version = 0;
 	for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..49c8cb9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -28,6 +28,11 @@
 
 static DEFINE_RWLOCK(proc_subdir_lock);
 
+struct proc_fs_info *proc_sb(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de)
 {
 	if (len < de->namelen)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..4743943 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,7 +103,8 @@ void __init proc_init_inodecache(void)
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct super_block *sb = root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 
 	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
@@ -473,11 +474,12 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 
 int proc_fill_super(struct super_block *s, void *data, int silent)
 {
-	struct pid_namespace *ns = get_pid_ns(s->s_fs_info);
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = get_pid_ns(fs_info->pid_ns);
 	struct inode *root_inode;
 	int ret;
 
-	if (!proc_parse_options(data, ns))
+	if (fs_info->version == 1 && !proc_parse_options(data, ns))
 		return -EINVAL;
 
 	/* User space would break if executables or devices appear on proc */
@@ -495,7 +497,7 @@ int proc_fill_super(struct super_block *s, void *data, int silent)
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	pde_get(&proc_root);
 	root_inode = proc_get_inode(s, &proc_root);
 	if (!root_inode) {
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..d4047ef 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/sched/stat.h>
+#include <linux/slab.h>
 #include <linux/module.h>
 #include <linux/bitops.h>
 #include <linux/user_namespace.h>
@@ -26,12 +27,13 @@
 #include "internal.h"
 
 enum {
-	Opt_gid, Opt_hidepid, Opt_err,
+	Opt_gid, Opt_hidepid, Opt_version ,Opt_err,
 };
 
 static const match_table_t tokens = {
 	{Opt_hidepid, "hidepid=%u"},
 	{Opt_gid, "gid=%u"},
+	{Opt_version, "version=%u"},
 	{Opt_err, NULL},
 };
 
@@ -67,6 +69,8 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 			}
 			pid->hide_pid = option;
 			break;
+		case Opt_version:
+			break;
 		default:
 			pr_err("proc: unrecognized mount option \"%s\" "
 			       "or missing value\n", p);
@@ -77,40 +81,205 @@ int proc_parse_options(char *options, struct pid_namespace *pid)
 	return 1;
 }
 
+struct proc_options {
+	int version;	/* version field auto set to 1 to not break userspace */
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
+int proc_parse_early_options(char *options, void *holder,
+			     struct proc_options *fs_options)
+{
+	char *p, *opts, *orig;
+	substring_t args[MAX_OPT_ARGS];
+	int option, ret = 0;
+
+	if (!options)
+		return 0;
+
+	opts = kstrdup(options, GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	orig = opts;
+
+	while ((p = strsep(&opts, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_version:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (option < 1 || option > 2) {
+				pr_err("proc: version value must be 1 or 2.\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->version = option;
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->pid_gid = make_kgid(current_user_ns(), option);
+			break;
+		case Opt_hidepid:
+			if (match_int(&args[0], &option)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (option < 0 || option > 2) {
+				pr_err("proc: hidepid value must be between 0 and 2.\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			fs_options->hide_pid = option;
+			break;
+		case Opt_err:
+			/*
+			 * pr_err("proc: unrecognized mount option \"%s\" ", p);
+			 * ret = -EINVAL;
+			 * goto out;
+			 */
+		default:
+			break;
+		}
+	}
+
+out:
+	kfree(orig);
+	return ret;
+}
+
+static int proc_test_super(struct super_block *s, void *data)
+{
+	int ret = 0;
+	struct proc_fs_info *p = data;
+	struct proc_fs_info *fs_info = proc_sb(s);
+
+	if (p->version == 1 && p->pid_ns == fs_info->pid_ns)
+		ret = 1;
+	/*
+	if (p->version == 2 && p->pid_ns == fs_info->pid_ns &&
+	    p->hide_pid == fs_info->hide_pid &&
+	    gid_eq(p->pid_gid, fs_info->pid_gid))
+		ret = 1;
+	*/
+	return ret;
+}
+
+static int proc_set_super(struct super_block *sb, void *data)
+{
+	sb->s_fs_info = data;
+	return set_anon_super(sb, NULL);
+}
+
 int proc_remount(struct super_block *sb, int *flags, char *data)
 {
-	struct pid_namespace *pid = sb->s_fs_info;
+	int error;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
+	struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };
 
 	sync_filesystem(sb);
-	return !proc_parse_options(data, pid);
+	error = proc_parse_early_options(data, sb->s_type, &fs_options);
+	if (error < 0)
+		return error;
+
+	if (fs_options.version == 1) {
+		error = proc_parse_options(data, pid);
+		if (!error)
+			return -EINVAL;
+	}
+
+	fs_info->version = fs_options.version;
+	fs_info->pid_gid = fs_options.pid_gid;
+	fs_info->hide_pid = fs_options.hide_pid;
+
+	return 0;
 }
 
 static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	struct pid_namespace *ns;
+	struct super_block *sb;
+	struct proc_fs_info *fs_info = NULL;
+	struct proc_options fs_options = { 1, GLOBAL_ROOT_GID, 0 };
+	int error = 0;
+
+	if (!(flags & MS_KERNMOUNT)) {
+		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+			return ERR_PTR(-EPERM);
+
+
+		error = proc_parse_early_options(data, fs_type, &fs_options);
+		if (error)
+			return ERR_PTR(error);
+	}
+
+	fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_NOFS);
+	if (!fs_info)
+		return ERR_PTR(-ENOMEM);
+
+	fs_info->version = fs_options.version;
+	fs_info->pid_gid = fs_options.pid_gid;
+	fs_info->hide_pid = fs_options.hide_pid;
 
 	if (flags & MS_KERNMOUNT) {
 		ns = data;
 		data = NULL;
+		fs_info->version = 1; /* Lets restore this */
 	} else {
 		ns = task_active_pid_ns(current);
 	}
 
-	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
+	fs_info->pid_ns = ns;
+
+	sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+			 ns->user_ns, fs_info);
+	if (IS_ERR(sb)) {
+		error = PTR_ERR(sb);
+		goto error_fs_info;
+	}
+
+	if (sb->s_root) {
+		kfree(fs_info);
+	} else {
+		error = proc_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+		if (error) {
+			deactivate_locked_super(sb);
+			goto error_fs_info;
+		}
+
+		sb->s_flags |= MS_ACTIVE;
+	}
+
+	return dget(sb->s_root);
+
+error_fs_info:
+	kfree(fs_info);
+	return ERR_PTR(error);
 }
 
 static void proc_kill_sb(struct super_block *sb)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb(sb);
+	struct pid_namespace *ns = (struct pid_namespace *)fs_info->pid_ns;
 
-	ns = (struct pid_namespace *)sb->s_fs_info;
 	if (ns->proc_self)
 		dput(ns->proc_self);
 	if (ns->proc_thread_self)
 		dput(ns->proc_thread_self);
 	kill_anon_super(sb);
 	put_pid_ns(ns);
+	kfree(fs_info);
 }
 
 static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 39857f6..9f95174 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -10,7 +10,8 @@ static const char *proc_self_get_link(struct dentry *dentry,
 				      struct inode *inode,
 				      struct delayed_call *done)
 {
-	struct pid_namespace *ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid_t tgid = task_tgid_nr_ns(current, ns);
 	char *name;
 
@@ -34,9 +35,10 @@ static unsigned self_inum;
 int proc_setup_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = s->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	struct dentry *self;
-	
+
 	inode_lock(root_inode);
 	self = d_alloc_name(s->s_root, "self");
 	if (self) {
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 20614b6..13d9aef 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -10,7 +10,8 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
 					     struct inode *inode,
 					     struct delayed_call *done)
 {
-	struct pid_namespace *ns = inode->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb(inode->i_sb);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	pid_t tgid = task_tgid_nr_ns(current, ns);
 	pid_t pid = task_pid_nr_ns(current, ns);
 	char *name;
@@ -34,8 +35,9 @@ static unsigned thread_self_inum;
 
 int proc_setup_thread_self(struct super_block *s)
 {
+	struct proc_fs_info *fs_info = proc_sb(s);
+	struct pid_namespace *ns = fs_info->pid_ns;
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = s->s_fs_info;
 	struct dentry *thread_self;
 
 	inode_lock(root_inode);
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index b5713fe..d0ae937 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -36,23 +36,23 @@ static unsigned mounts_poll(struct file *file, poll_table *wait)
 	return res;
 }
 
-struct proc_fs_info {
+struct proc_fs_opts {
 	int flag;
 	const char *str;
 };
 
 static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 {
-	static const struct proc_fs_info fs_info[] = {
+	static const struct proc_fs_opts fs_opts[] = {
 		{ MS_SYNCHRONOUS, ",sync" },
 		{ MS_DIRSYNC, ",dirsync" },
 		{ MS_MANDLOCK, ",mand" },
 		{ MS_LAZYTIME, ",lazytime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
@@ -62,7 +62,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 
 static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 {
-	static const struct proc_fs_info mnt_info[] = {
+	static const struct proc_fs_opts mnt_opts[] = {
 		{ MNT_NOSUID, ",nosuid" },
 		{ MNT_NODEV, ",nodev" },
 		{ MNT_NOEXEC, ",noexec" },
@@ -71,9 +71,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_RELATIME, ",relatime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2d2bf59..27a1e85 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -7,10 +7,18 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 
+struct proc_fs_info {
+	struct pid_namespace *pid_ns;
+	int version;
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
 struct proc_dir_entry;
 
 #ifdef CONFIG_PROC_FS
 
+extern struct proc_fs_info *proc_sb(struct super_block *sb);
 extern void proc_root_init(void);
 extern void proc_flush_task(struct task_struct *);
 
@@ -53,6 +61,7 @@ static inline void proc_flush_task(struct task_struct *task)
 {
 }
 
+extern inline struct proc_fs_info *proc_sb(struct super_block *sb) { return NULL;}
 static inline struct proc_dir_entry *proc_symlink(const char *name,
 		struct proc_dir_entry *parent,const char *dest) { return NULL;}
 static inline struct proc_dir_entry *proc_mkdir(const char *name,
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-20 12:58         ` [RFC] Add option to mount only a pids subset Alexey Gladkov
@ 2017-03-23 16:05             ` Oleg Nesterov
  2017-03-23 16:06             ` Djalal Harouni
  1 sibling, 0 replies; 53+ messages in thread
From: Oleg Nesterov @ 2017-03-23 16:05 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

Again, I can't really review this, I know nothing about vfs, but since
nobody else replied...

On 03/20, Alexey Gladkov wrote:
>
> @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
>  		ns = task_active_pid_ns(current);
>  	}
>
> -	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +	root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +
> +	if (!IS_ERR(root)) {
> +		if (!proc_fill_options(data, &opts))
> +			return ERR_PTR(-EINVAL);

So we have to call proc_fill_options() twice, not good... Yes, I understand
why, but perhaps we factor it out somehow, we can pack options + pid_ns into
sb->s_fs_info. Nevermind, this is minor.

> +		if (opts.pid_only) {
> +			int ret;
> +
> +			if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> +				return ERR_PTR(ret);
> +
> +			root = ns->pidfs;

Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
But this all is fixeable.

So with this change "mount -opidonly" creates another IS_ROOT() dentry which
is not equal to sb->s_root. I simply do not know if this is technically
correct or not... but, say, the "Only bind mounts can have disconnected paths"
comment in path_connected() makes me worry ;)

And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
the normal /proc mount. Not really good imo even if not really wrong... Lets
look at proc_flush_task(). The exiting task will flush its $pid dentries in
/proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
still...

Oleg.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-23 16:05             ` Oleg Nesterov
  0 siblings, 0 replies; 53+ messages in thread
From: Oleg Nesterov @ 2017-03-23 16:05 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

Again, I can't really review this, I know nothing about vfs, but since
nobody else replied...

On 03/20, Alexey Gladkov wrote:
>
> @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
>  		ns = task_active_pid_ns(current);
>  	}
>
> -	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +	root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +
> +	if (!IS_ERR(root)) {
> +		if (!proc_fill_options(data, &opts))
> +			return ERR_PTR(-EINVAL);

So we have to call proc_fill_options() twice, not good... Yes, I understand
why, but perhaps we factor it out somehow, we can pack options + pid_ns into
sb->s_fs_info. Nevermind, this is minor.

> +		if (opts.pid_only) {
> +			int ret;
> +
> +			if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> +				return ERR_PTR(ret);
> +
> +			root = ns->pidfs;

Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
But this all is fixeable.

So with this change "mount -opidonly" creates another IS_ROOT() dentry which
is not equal to sb->s_root. I simply do not know if this is technically
correct or not... but, say, the "Only bind mounts can have disconnected paths"
comment in path_connected() makes me worry ;)

And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
the normal /proc mount. Not really good imo even if not really wrong... Lets
look at proc_flush_task(). The exiting task will flush its $pid dentries in
/proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
still...

Oleg.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-20 12:58         ` [RFC] Add option to mount only a pids subset Alexey Gladkov
@ 2017-03-23 16:06             ` Djalal Harouni
  2017-03-23 16:06             ` Djalal Harouni
  1 sibling, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-23 16:06 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin

Hi Alexey,

On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
<gladkov.alexey@gmail.com> wrote:
>
>
> Al Viro, this patch looks better ?
>
> == Overview ==
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids without
> rest objects. Since this is an option for /proc, flags applied to /proc
> have an effect on this subset of filesystem.

I just sent a patch that also has to deal with proc hidepid here:
https://lkml.org/lkml/2017/3/23/505

I'm not sure if that's the right approach, it is still buggy, however
seems that your patch also stores the mount option inside the
pid_namespace which may get propagated to all mounts inside same pidns
?

I didn't have enough time but maybe if they are related we can work it
out together ?

Thank you!


-- 
tixxdz

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-23 16:06             ` Djalal Harouni
  0 siblings, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-23 16:06 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin

Hi Alexey,

On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
<gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>
> Al Viro, this patch looks better ?
>
> == Overview ==
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids without
> rest objects. Since this is an option for /proc, flags applied to /proc
> have an effect on this subset of filesystem.

I just sent a patch that also has to deal with proc hidepid here:
https://lkml.org/lkml/2017/3/23/505

I'm not sure if that's the right approach, it is still buggy, however
seems that your patch also stores the mount option inside the
pid_namespace which may get propagated to all mounts inside same pidns
?

I didn't have enough time but maybe if they are related we can work it
out together ?

Thank you!


-- 
tixxdz

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-23 16:06             ` Djalal Harouni
  (?)
@ 2017-03-23 22:07             ` Alexey Gladkov
  2017-03-26  7:03                 ` Djalal Harouni
  -1 siblings, 1 reply; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-23 22:07 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin

On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
> Hi Alexey,
> 
> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
> <gladkov.alexey@gmail.com> wrote:
> >
> >
> > Al Viro, this patch looks better ?
> >
> > == Overview ==
> >
> > Some of the container virtualization systems are mounted /proc inside
> > the container. This is done in most cases to operate with information
> > about the processes. Knowing that /proc filesystem is not fully
> > virtualized they are mounted on top of dangerous places empty files or
> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
> >
> > The structure of this filesystem is dynamic and any module can create a
> > new object which will not necessarily be virtualized. There are
> > proprietary modules that aren't in the mainline whose work we can not
> > verify.
> >
> > This opens up a potential threat to the system. The developers of the
> > virtualization system can't predict all dangerous places in /proc by
> > definition.
> >
> > A more effective solution would be to mount into the container only what
> > is necessary and ignore the rest.
> >
> > Right now there is the opportunity to pass in the container any port of
> > the /proc filesystem using mount --bind expect the pids.
> >
> > This patch allows to mount only the part of /proc related to pids without
> > rest objects. Since this is an option for /proc, flags applied to /proc
> > have an effect on this subset of filesystem.
> 
> I just sent a patch that also has to deal with proc hidepid here:
> https://lkml.org/lkml/2017/3/23/505

I completely agree with you that it looks wrong when options of one
mountpoint affect the others mountpoints.

> I'm not sure if that's the right approach, it is still buggy, however
> seems that your patch also stores the mount option inside the
> pid_namespace which may get propagated to all mounts inside same pidns
> ?

I don't store 'pidonly' option in my current patch. I mean in:
https://lkml.org/lkml/2017/3/20/363

I parse options twice at first mount of procfs. It happens before
the mounting /proc in userspace.

I know it's bad, but I couldn't find place to store this option.

> I didn't have enough time but maybe if they are related we can work it
> out together ?

I don't have enough experience in kenrel hacking, but I would happily do
my best :)

I also tring to mention it in every patch, as my changes almost completely
useless without the ability to use the overlayfs.

Now if you remove the restriction:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

and mount procfs as lowerdir in overlayfs then you get NULL pointer
dereference at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891

I got it when I tried to do `ls -la /overlay/proc/self/`.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-23 22:57               ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-23 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote:
> Again, I can't really review this, I know nothing about vfs, but since
> nobody else replied...

Thanks anyway :)

> On 03/20, Alexey Gladkov wrote:
> >
> > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
> >  		ns = task_active_pid_ns(current);
> >  	}
> >
> > -	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +	root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +
> > +	if (!IS_ERR(root)) {
> > +		if (!proc_fill_options(data, &opts))
> > +			return ERR_PTR(-EINVAL);
> 
> So we have to call proc_fill_options() twice, not good... Yes, I understand
> why, but perhaps we factor it out somehow, we can pack options + pid_ns into
> sb->s_fs_info. Nevermind, this is minor.

It happens only once, when we don't have the s_root yet.

> > +		if (opts.pid_only) {
> > +			int ret;
> > +
> > +			if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> > +				return ERR_PTR(ret);
> > +
> > +			root = ns->pidfs;
> 
> Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
> in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
> if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
> But this all is fixeable.
> 
> So with this change "mount -opidonly" creates another IS_ROOT() dentry which
> is not equal to sb->s_root. I simply do not know if this is technically
> correct or not... but, say, the "Only bind mounts can have disconnected paths"
> comment in path_connected() makes me worry ;)
> 
> And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
> the normal /proc mount. Not really good imo even if not really wrong... Lets
> look at proc_flush_task(). The exiting task will flush its $pid dentries in
> /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
> still...

I know that I'm cheater, but I did not start first :)

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-23 22:57               ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-23 22:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Pavel Emelyanov,
	James Bottomley, Dmitry V. Levin

On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote:
> Again, I can't really review this, I know nothing about vfs, but since
> nobody else replied...

Thanks anyway :)

> On 03/20, Alexey Gladkov wrote:
> >
> > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
> >  		ns = task_active_pid_ns(current);
> >  	}
> >
> > -	return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +	root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +
> > +	if (!IS_ERR(root)) {
> > +		if (!proc_fill_options(data, &opts))
> > +			return ERR_PTR(-EINVAL);
> 
> So we have to call proc_fill_options() twice, not good... Yes, I understand
> why, but perhaps we factor it out somehow, we can pack options + pid_ns into
> sb->s_fs_info. Nevermind, this is minor.

It happens only once, when we don't have the s_root yet.

> > +		if (opts.pid_only) {
> > +			int ret;
> > +
> > +			if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> > +				return ERR_PTR(ret);
> > +
> > +			root = ns->pidfs;
> 
> Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
> in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
> if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
> But this all is fixeable.
> 
> So with this change "mount -opidonly" creates another IS_ROOT() dentry which
> is not equal to sb->s_root. I simply do not know if this is technically
> correct or not... but, say, the "Only bind mounts can have disconnected paths"
> comment in path_connected() makes me worry ;)
> 
> And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
> the normal /proc mount. Not really good imo even if not really wrong... Lets
> look at proc_flush_task(). The exiting task will flush its $pid dentries in
> /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
> still...

I know that I'm cheater, but I did not start first :)

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
  2017-03-23 22:07             ` Alexey Gladkov
@ 2017-03-26  7:03                 ` Djalal Harouni
  0 siblings, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-26  7:03 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin, Jann Horn,
	Kees Cook, Andy Lutomirski, Miklos Szeredi

(Cc'ed Kees and Jann for the procfs stacking issue)


On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov
<gladkov.alexey@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
>> Hi Alexey,
>>
>> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>> <gladkov.alexey@gmail.com> wrote:
>> >
>> >
>> > Al Viro, this patch looks better ?
>> >
>> > == Overview ==
>> >
>> > Some of the container virtualization systems are mounted /proc inside
>> > the container. This is done in most cases to operate with information
>> > about the processes. Knowing that /proc filesystem is not fully
>> > virtualized they are mounted on top of dangerous places empty files or
>> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>> >
>> > The structure of this filesystem is dynamic and any module can create a
>> > new object which will not necessarily be virtualized. There are
>> > proprietary modules that aren't in the mainline whose work we can not
>> > verify.
>> >
>> > This opens up a potential threat to the system. The developers of the
>> > virtualization system can't predict all dangerous places in /proc by
>> > definition.
>> >
>> > A more effective solution would be to mount into the container only what
>> > is necessary and ignore the rest.
>> >
>> > Right now there is the opportunity to pass in the container any port of
>> > the /proc filesystem using mount --bind expect the pids.
>> >
>> > This patch allows to mount only the part of /proc related to pids without
>> > rest objects. Since this is an option for /proc, flags applied to /proc
>> > have an effect on this subset of filesystem.
>>
>> I just sent a patch that also has to deal with proc hidepid here:
>> https://lkml.org/lkml/2017/3/23/505
>
> I completely agree with you that it looks wrong when options of one
> mountpoint affect the others mountpoints.
>
>> I'm not sure if that's the right approach, it is still buggy, however
>> seems that your patch also stores the mount option inside the
>> pid_namespace which may get propagated to all mounts inside same pidns
>> ?
>
> I don't store 'pidonly' option in my current patch. I mean in:
> https://lkml.org/lkml/2017/3/20/363
>
> I parse options twice at first mount of procfs. It happens before
> the mounting /proc in userspace.
>
> I know it's bad, but I couldn't find place to store this option.

Ok, then maybe that approach of having a procfs struct holder instead
of using pid namespace may help!


>> I didn't have enough time but maybe if they are related we can work it
>> out together ?
>
> I don't have enough experience in kenrel hacking, but I would happily do
> my best :)

OK, let's sync on this then.

> I also tring to mention it in every patch, as my changes almost completely
> useless without the ability to use the overlayfs.
>
> Now if you remove the restriction:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

This protection was introduced in e54ad7f1ee263

That's a security fix, ecryptfs seems to schedule some code through
kthreads which always run in the initial namespaces with full
capabilities, bypassing all security checks.

I don't know if overlayfs has something similar, if not then maybe if
it's possible in ecryptfs to check the lower mount and the fs type and
move this procfs there...  Cc'ed Jann in case he may comment.

> and mount procfs as lowerdir in overlayfs then you get NULL pointer
> dereference at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891
>
> I got it when I tried to do `ls -la /overlay/proc/self/`.
>
> --
> Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-26  7:03                 ` Djalal Harouni
  0 siblings, 0 replies; 53+ messages in thread
From: Djalal Harouni @ 2017-03-26  7:03 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin, Jann Horn,
	Kees Cook, Andy Lutomirski, Miklos Szeredi

(Cc'ed Kees and Jann for the procfs stacking issue)


On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov
<gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
>> Hi Alexey,
>>
>> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>> <gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >
>> >
>> > Al Viro, this patch looks better ?
>> >
>> > == Overview ==
>> >
>> > Some of the container virtualization systems are mounted /proc inside
>> > the container. This is done in most cases to operate with information
>> > about the processes. Knowing that /proc filesystem is not fully
>> > virtualized they are mounted on top of dangerous places empty files or
>> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>> >
>> > The structure of this filesystem is dynamic and any module can create a
>> > new object which will not necessarily be virtualized. There are
>> > proprietary modules that aren't in the mainline whose work we can not
>> > verify.
>> >
>> > This opens up a potential threat to the system. The developers of the
>> > virtualization system can't predict all dangerous places in /proc by
>> > definition.
>> >
>> > A more effective solution would be to mount into the container only what
>> > is necessary and ignore the rest.
>> >
>> > Right now there is the opportunity to pass in the container any port of
>> > the /proc filesystem using mount --bind expect the pids.
>> >
>> > This patch allows to mount only the part of /proc related to pids without
>> > rest objects. Since this is an option for /proc, flags applied to /proc
>> > have an effect on this subset of filesystem.
>>
>> I just sent a patch that also has to deal with proc hidepid here:
>> https://lkml.org/lkml/2017/3/23/505
>
> I completely agree with you that it looks wrong when options of one
> mountpoint affect the others mountpoints.
>
>> I'm not sure if that's the right approach, it is still buggy, however
>> seems that your patch also stores the mount option inside the
>> pid_namespace which may get propagated to all mounts inside same pidns
>> ?
>
> I don't store 'pidonly' option in my current patch. I mean in:
> https://lkml.org/lkml/2017/3/20/363
>
> I parse options twice at first mount of procfs. It happens before
> the mounting /proc in userspace.
>
> I know it's bad, but I couldn't find place to store this option.

Ok, then maybe that approach of having a procfs struct holder instead
of using pid namespace may help!


>> I didn't have enough time but maybe if they are related we can work it
>> out together ?
>
> I don't have enough experience in kenrel hacking, but I would happily do
> my best :)

OK, let's sync on this then.

> I also tring to mention it in every patch, as my changes almost completely
> useless without the ability to use the overlayfs.
>
> Now if you remove the restriction:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

This protection was introduced in e54ad7f1ee263

That's a security fix, ecryptfs seems to schedule some code through
kthreads which always run in the initial namespaces with full
capabilities, bypassing all security checks.

I don't know if overlayfs has something similar, if not then maybe if
it's possible in ecryptfs to check the lower mount and the fs type and
move this procfs there...  Cc'ed Jann in case he may comment.

> and mount procfs as lowerdir in overlayfs then you get NULL pointer
> dereference at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891
>
> I got it when I tried to do `ls -la /overlay/proc/self/`.
>
> --
> Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-30 21:45                   ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-30 21:45 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin, Jann Horn,
	Kees Cook, Andy Lutomirski, Miklos Szeredi

On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote:
> (Cc'ed Kees and Jann for the procfs stacking issue)
> 
> > I completely agree with you that it looks wrong when options of one
> > mountpoint affect the others mountpoints.
> >
> >> I'm not sure if that's the right approach, it is still buggy, however
> >> seems that your patch also stores the mount option inside the
> >> pid_namespace which may get propagated to all mounts inside same pidns
> >> ?
> >
> > I don't store 'pidonly' option in my current patch. I mean in:
> > https://lkml.org/lkml/2017/3/20/363
> >
> > I parse options twice at first mount of procfs. It happens before
> > the mounting /proc in userspace.
> >
> > I know it's bad, but I couldn't find place to store this option.
> 
> Ok, then maybe that approach of having a procfs struct holder instead
> of using pid namespace may help!

I deside to stop doing my patch. I talked with a few people and found out
that the overlayfs doesn't feel very well if on the lower level filesystem
appear/disappear files.

In addition with the pidfs isn't so simple. Separate the root will lead to
a doubling of memory consumption and restrictions on the filesystem operations
level can easily be skipped.

It means that even I can do this pidfs (or pid subset in /proc), it
will be pointless.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC] Add option to mount only a pids subset
@ 2017-03-30 21:45                   ` Alexey Gladkov
  0 siblings, 0 replies; 53+ messages in thread
From: Alexey Gladkov @ 2017-03-30 21:45 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Linux Kernel Mailing List, Linux API, Kirill A. Shutemov,
	Vasiliy Kulikov, Al Viro, Eric W. Biederman, Oleg Nesterov,
	Pavel Emelyanov, James Bottomley, Dmitry V. Levin, Jann Horn,
	Kees Cook, Andy Lutomirski, Miklos Szeredi

On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote:
> (Cc'ed Kees and Jann for the procfs stacking issue)
> 
> > I completely agree with you that it looks wrong when options of one
> > mountpoint affect the others mountpoints.
> >
> >> I'm not sure if that's the right approach, it is still buggy, however
> >> seems that your patch also stores the mount option inside the
> >> pid_namespace which may get propagated to all mounts inside same pidns
> >> ?
> >
> > I don't store 'pidonly' option in my current patch. I mean in:
> > https://lkml.org/lkml/2017/3/20/363
> >
> > I parse options twice at first mount of procfs. It happens before
> > the mounting /proc in userspace.
> >
> > I know it's bad, but I couldn't find place to store this option.
> 
> Ok, then maybe that approach of having a procfs struct holder instead
> of using pid namespace may help!

I deside to stop doing my patch. I talked with a few people and found out
that the overlayfs doesn't feel very well if on the lower level filesystem
appear/disappear files.

In addition with the pidfs isn't so simple. Separate the root will lead to
a doubling of memory consumption and restrictions on the filesystem operations
level can easily be skipped.

It means that even I can do this pidfs (or pid subset in /proc), it
will be pointless.

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-03-30 21:45 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-18 22:53 [PATCH] Add pidfs filesystem Alexey Gladkov
2017-02-18 23:34 ` kbuild test robot
2017-02-18 23:34 ` kbuild test robot
2017-02-20  4:05 ` Eric W. Biederman
2017-02-20 10:36   ` Alexey Gladkov
2017-02-22 20:11   ` Richard Weinberger
2017-02-21 14:57 ` Oleg Nesterov
2017-02-22  7:40   ` Pavel Emelyanov
2017-02-22 12:04     ` Alexey Gladkov
2017-02-22 13:08       ` Pavel Emelyanov
2017-02-22 11:53   ` Alexey Gladkov
2017-02-22 15:37   ` Dmitry V. Levin
2017-02-22 17:48     ` Oleg Nesterov
2017-02-22 19:56       ` Alexey Gladkov
2017-03-06 23:05   ` [RFC] Add option to mount only a pids subset Alexey Gladkov
2017-03-06 23:05     ` Alexey Gladkov
2017-03-07 16:24     ` Andy Lutomirski
2017-03-07 16:24       ` Andy Lutomirski
2017-03-09 11:26       ` Djalal Harouni
2017-03-09 20:52         ` Eric W. Biederman
2017-03-09 20:52           ` Eric W. Biederman
2017-03-11 21:51         ` Alexey Gladkov
2017-03-11 21:51           ` Alexey Gladkov
2017-03-11  0:05       ` Alexey Gladkov
2017-03-11  0:05         ` Alexey Gladkov
2017-03-07 17:49     ` Oleg Nesterov
2017-03-10 23:46       ` Alexey Gladkov
2017-03-10 23:46         ` Alexey Gladkov
2017-03-12  1:54     ` Al Viro
2017-03-12  1:54       ` Al Viro
2017-03-12  2:13       ` Al Viro
2017-03-12  2:13         ` Al Viro
2017-03-13  3:19         ` Andy Lutomirski
2017-03-13  3:19           ` Andy Lutomirski
2017-03-13 13:27           ` Al Viro
2017-03-13 13:27             ` Al Viro
2017-03-13 15:24             ` Andy Lutomirski
2017-03-13 15:24               ` Andy Lutomirski
2017-03-23 15:59               ` [PATCH] proc: allow to change proc mount options per mount Djalal Harouni
2017-03-23 15:59                 ` Djalal Harouni
2017-03-20 12:58         ` [RFC] Add option to mount only a pids subset Alexey Gladkov
2017-03-23 16:05           ` Oleg Nesterov
2017-03-23 16:05             ` Oleg Nesterov
2017-03-23 22:57             ` Alexey Gladkov
2017-03-23 22:57               ` Alexey Gladkov
2017-03-23 16:06           ` Djalal Harouni
2017-03-23 16:06             ` Djalal Harouni
2017-03-23 22:07             ` Alexey Gladkov
2017-03-26  7:03               ` Djalal Harouni
2017-03-26  7:03                 ` Djalal Harouni
2017-03-30 21:45                 ` Alexey Gladkov
2017-03-30 21:45                   ` Alexey Gladkov
2017-02-27 18:56 ` [PATCH] Add pidfs filesystem Michael Kerrisk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.