All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Make core_pattern support namespace
@ 2016-08-02  9:08 ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

This patchset includes following function points:
1: Let pipe_type core_pattern write dump into container's rootfs
   done by: [PATCH v2 1/2] Limit dump_pipe program's permission to
   init for container
2: Make separate core_pattern setting for each container
   done by: [PATCH v2 2/2] Make core_pattern support namespace
3: Compatibility with current system
   also included in: [PATCH v2 2/2] Make core_pattern support namespace
   If container hadn't change core_pattern setting, it will keep
   same setting with host.

Test:
1: Pass a test script for each function of this patchset
   # ./test_host
   Set file core_pattern: OK
   ./test_host: line 41:  4237 Segmentation fault      (core dumped) "$SCRIPT_BASE_DIR"/make_dump
   Checking dumpfile: OK
   Set file core_pattern: OK
   ./test_host: line 41:  4240 Segmentation fault      (core dumped) "$SCRIPT_BASE_DIR"/make_dump
   Checking dump_pipe triggered: OK
   Checking rootfs: OK
   Checking dumpfile: OK
   Checking namespace: OK
   Checking process list: OK
   Checking capabilities: OK
   #
   # /test
   Segmentation fault (core dumped)
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_cap_init
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_cap_self
   -rw-r--r--    1 root     root            5 Aug  2 09:01 /tmp/dumptest_cg_cgpid
   -rw-r--r--    1 root     root       327680 Aug  2 09:01 /tmp/dumptest_guestdump_11_524288000_16_0_0_1470128516
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_ns_init
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_ns_self
   -rw-r--r--    1 root     root          288 Aug  2 09:01 /tmp/dumptest_psresult
   -rw-r--r--    1 root     root         1728 Aug  2 09:01 /tmp/dumptest_rootfs
   Checking dump_pipe triggered: OK
   Checking rootfs: OK
   Checking dumpfile: OK
   Checking namespace: OK
   Checking process list: OK
   Checking cg pids: OK
   Checking capabilities: OK
2: Pass other test(which is not easy to do in script) by hand.

Changelog v1(RFC)->v2:
1: Add [PATCH 2/2] which was todo in [RFC v1].
2: Pass a test script for each function.
3: Rebase on top of v4.7.

Suggested-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

Zhao Lei (2):
  Limit dump_pipe program's permission to init for container
  Make core_pattern support namespace

 fs/coredump.c                 | 110 +++++++++++++++++++++++++++++++++++++++---
 include/linux/binfmts.h       |   1 +
 include/linux/pid_namespace.h |   3 ++
 kernel/pid.c                  |   2 +
 kernel/pid_namespace.c        |   2 +
 kernel/sysctl.c               |  50 +++++++++++++++++--
 6 files changed, 156 insertions(+), 12 deletions(-)

-- 
1.8.5.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 0/2] Make core_pattern support namespace
@ 2016-08-02  9:08 ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Eric W. Biederman, Mateusz Guzik, Kamezawa Hiroyuki,
	Stéphane Graber, Zhao Lei

This patchset includes following function points:
1: Let pipe_type core_pattern write dump into container's rootfs
   done by: [PATCH v2 1/2] Limit dump_pipe program's permission to
   init for container
2: Make separate core_pattern setting for each container
   done by: [PATCH v2 2/2] Make core_pattern support namespace
3: Compatibility with current system
   also included in: [PATCH v2 2/2] Make core_pattern support namespace
   If container hadn't change core_pattern setting, it will keep
   same setting with host.

Test:
1: Pass a test script for each function of this patchset
   # ./test_host
   Set file core_pattern: OK
   ./test_host: line 41:  4237 Segmentation fault      (core dumped) "$SCRIPT_BASE_DIR"/make_dump
   Checking dumpfile: OK
   Set file core_pattern: OK
   ./test_host: line 41:  4240 Segmentation fault      (core dumped) "$SCRIPT_BASE_DIR"/make_dump
   Checking dump_pipe triggered: OK
   Checking rootfs: OK
   Checking dumpfile: OK
   Checking namespace: OK
   Checking process list: OK
   Checking capabilities: OK
   #
   # /test
   Segmentation fault (core dumped)
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_cap_init
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_cap_self
   -rw-r--r--    1 root     root            5 Aug  2 09:01 /tmp/dumptest_cg_cgpid
   -rw-r--r--    1 root     root       327680 Aug  2 09:01 /tmp/dumptest_guestdump_11_524288000_16_0_0_1470128516
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_ns_init
   -rw-r--r--    1 root     root          123 Aug  2 09:01 /tmp/dumptest_ns_self
   -rw-r--r--    1 root     root          288 Aug  2 09:01 /tmp/dumptest_psresult
   -rw-r--r--    1 root     root         1728 Aug  2 09:01 /tmp/dumptest_rootfs
   Checking dump_pipe triggered: OK
   Checking rootfs: OK
   Checking dumpfile: OK
   Checking namespace: OK
   Checking process list: OK
   Checking cg pids: OK
   Checking capabilities: OK
2: Pass other test(which is not easy to do in script) by hand.

Changelog v1(RFC)->v2:
1: Add [PATCH 2/2] which was todo in [RFC v1].
2: Pass a test script for each function.
3: Rebase on top of v4.7.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>

Zhao Lei (2):
  Limit dump_pipe program's permission to init for container
  Make core_pattern support namespace

 fs/coredump.c                 | 110 +++++++++++++++++++++++++++++++++++++++---
 include/linux/binfmts.h       |   1 +
 include/linux/pid_namespace.h |   3 ++
 kernel/pid.c                  |   2 +
 kernel/pid_namespace.c        |   2 +
 kernel/sysctl.c               |  50 +++++++++++++++++--
 6 files changed, 156 insertions(+), 12 deletions(-)

-- 
1.8.5.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
  2016-08-02  9:08 ` Zhao Lei
@ 2016-08-02  9:08     ` Zhao Lei
  -1 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Currently when we set core_pattern to a pipe, the pipe program is
forked by kthread running with root's permission, and write dumpfile
into host's filesystem.
Same thing happened for container, the dumper and dumpfile are also
in host(not in container).

It have following program:
1: Not consistent with file_type core_pattern
   When we set core_pattern to a file, the container will write dump
   into container's filesystem instead of host.
2: Not safe for privileged container
   In a privileged container, user can destroy host system by following
   command:
   # # In a container
   # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
   # make_dump

This patch switch dumper program's environment to init task, so, for
container, dumper program have same environment with init task in
container, which make dumper program put in container's filesystem, and
write coredump into container's filesystem.
The dumper's permission is also limited into subset of container's init
process.

Suggested-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 fs/coredump.c           | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/binfmts.h |  1 +
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281b768..8511267 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 {
 	struct file *files[2];
 	struct coredump_params *cp = (struct coredump_params *)info->data;
+	struct task_struct *base_task;
+
 	int err = create_pipe_files(files, 0);
 	if (err)
 		return err;
@@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 
 	err = replace_fd(0, files[0], 0);
 	fput(files[0]);
+	if (err)
+		return err;
+
 	/* and disallow core files too */
 	current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
 
-	return err;
+	base_task = cp->base_task;
+	if (base_task) {
+		const struct cred *base_cred;
+
+		/* Set fs_root to base_task */
+		spin_lock(&base_task->fs->lock);
+		set_fs_root(current->fs, &base_task->fs->root);
+		spin_unlock(&base_task->fs->lock);
+
+		/* Set namespaces to base_task */
+		get_nsproxy(base_task->nsproxy);
+		switch_task_namespaces(current, base_task->nsproxy);
+
+		/* Set cgroup to base_task */
+		current->flags &= ~PF_NO_SETAFFINITY;
+		err = cgroup_attach_task_all(base_task, current);
+		if (err < 0)
+			return err;
+
+		/* Set cred to base_task */
+		base_cred = get_task_cred(base_task);
+
+		new->uid   = base_cred->uid;
+		new->gid   = base_cred->gid;
+		new->suid  = base_cred->suid;
+		new->sgid  = base_cred->sgid;
+		new->euid  = base_cred->euid;
+		new->egid  = base_cred->egid;
+		new->fsuid = base_cred->fsuid;
+		new->fsgid = base_cred->fsgid;
+
+		new->securebits = base_cred->securebits;
+
+		new->cap_inheritable = base_cred->cap_inheritable;
+		new->cap_permitted   = base_cred->cap_permitted;
+		new->cap_effective   = base_cred->cap_effective;
+		new->cap_bset        = base_cred->cap_bset;
+		new->cap_ambient     = base_cred->cap_ambient;
+
+		security_cred_free(new);
+#ifdef CONFIG_SECURITY
+		new->security = NULL;
+#endif
+		err = security_prepare_creds(new, base_cred, GFP_KERNEL);
+		if (err < 0) {
+			put_cred(base_cred);
+			return err;
+		}
+
+		free_uid(new->user);
+		new->user = base_cred->user;
+		get_uid(new->user);
+
+		put_user_ns(new->user_ns);
+		new->user_ns = base_cred->user_ns;
+		get_user_ns(new->user_ns);
+
+		put_group_info(new->group_info);
+		new->group_info = base_cred->group_info;
+		get_group_info(new->group_info);
+
+		put_cred(base_cred);
+
+		validate_creds(new);
+	}
+
+	return 0;
 }
 
 void do_coredump(const siginfo_t *siginfo)
@@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
 
 	if (ispipe) {
 		int dump_count;
+		struct task_struct *vinit_task;
 		char **helper_argv;
 		struct subprocess_info *sub_info;
 
@@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
+		rcu_read_lock();
+		vinit_task = find_task_by_vpid(1);
+		rcu_read_unlock();
+		if (!vinit_task) {
+			printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
+			goto fail_dropcount;
+		}
+
 		helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
 		if (!helper_argv) {
 			printk(KERN_WARNING "%s failed to allocate memory\n",
@@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
+		get_task_struct(vinit_task);
+
+		cprm.base_task = vinit_task;
+
 		retval = -ENOMEM;
 		sub_info = call_usermodehelper_setup(helper_argv[0],
 						helper_argv, NULL, GFP_KERNEL,
@@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
 			retval = call_usermodehelper_exec(sub_info,
 							  UMH_WAIT_EXEC);
 
+		put_task_struct(vinit_task);
 		argv_free(helper_argv);
 		if (retval) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 314b3ca..0c9a72c 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -59,6 +59,7 @@ struct linux_binprm {
 
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
+	struct task_struct *base_task;
 	const siginfo_t *siginfo;
 	struct pt_regs *regs;
 	struct file *file;
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
@ 2016-08-02  9:08     ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Eric W. Biederman, Mateusz Guzik, Kamezawa Hiroyuki,
	Stéphane Graber, Zhao Lei

Currently when we set core_pattern to a pipe, the pipe program is
forked by kthread running with root's permission, and write dumpfile
into host's filesystem.
Same thing happened for container, the dumper and dumpfile are also
in host(not in container).

It have following program:
1: Not consistent with file_type core_pattern
   When we set core_pattern to a file, the container will write dump
   into container's filesystem instead of host.
2: Not safe for privileged container
   In a privileged container, user can destroy host system by following
   command:
   # # In a container
   # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
   # make_dump

This patch switch dumper program's environment to init task, so, for
container, dumper program have same environment with init task in
container, which make dumper program put in container's filesystem, and
write coredump into container's filesystem.
The dumper's permission is also limited into subset of container's init
process.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/coredump.c           | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/binfmts.h |  1 +
 2 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 281b768..8511267 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 {
 	struct file *files[2];
 	struct coredump_params *cp = (struct coredump_params *)info->data;
+	struct task_struct *base_task;
+
 	int err = create_pipe_files(files, 0);
 	if (err)
 		return err;
@@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 
 	err = replace_fd(0, files[0], 0);
 	fput(files[0]);
+	if (err)
+		return err;
+
 	/* and disallow core files too */
 	current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
 
-	return err;
+	base_task = cp->base_task;
+	if (base_task) {
+		const struct cred *base_cred;
+
+		/* Set fs_root to base_task */
+		spin_lock(&base_task->fs->lock);
+		set_fs_root(current->fs, &base_task->fs->root);
+		spin_unlock(&base_task->fs->lock);
+
+		/* Set namespaces to base_task */
+		get_nsproxy(base_task->nsproxy);
+		switch_task_namespaces(current, base_task->nsproxy);
+
+		/* Set cgroup to base_task */
+		current->flags &= ~PF_NO_SETAFFINITY;
+		err = cgroup_attach_task_all(base_task, current);
+		if (err < 0)
+			return err;
+
+		/* Set cred to base_task */
+		base_cred = get_task_cred(base_task);
+
+		new->uid   = base_cred->uid;
+		new->gid   = base_cred->gid;
+		new->suid  = base_cred->suid;
+		new->sgid  = base_cred->sgid;
+		new->euid  = base_cred->euid;
+		new->egid  = base_cred->egid;
+		new->fsuid = base_cred->fsuid;
+		new->fsgid = base_cred->fsgid;
+
+		new->securebits = base_cred->securebits;
+
+		new->cap_inheritable = base_cred->cap_inheritable;
+		new->cap_permitted   = base_cred->cap_permitted;
+		new->cap_effective   = base_cred->cap_effective;
+		new->cap_bset        = base_cred->cap_bset;
+		new->cap_ambient     = base_cred->cap_ambient;
+
+		security_cred_free(new);
+#ifdef CONFIG_SECURITY
+		new->security = NULL;
+#endif
+		err = security_prepare_creds(new, base_cred, GFP_KERNEL);
+		if (err < 0) {
+			put_cred(base_cred);
+			return err;
+		}
+
+		free_uid(new->user);
+		new->user = base_cred->user;
+		get_uid(new->user);
+
+		put_user_ns(new->user_ns);
+		new->user_ns = base_cred->user_ns;
+		get_user_ns(new->user_ns);
+
+		put_group_info(new->group_info);
+		new->group_info = base_cred->group_info;
+		get_group_info(new->group_info);
+
+		put_cred(base_cred);
+
+		validate_creds(new);
+	}
+
+	return 0;
 }
 
 void do_coredump(const siginfo_t *siginfo)
@@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
 
 	if (ispipe) {
 		int dump_count;
+		struct task_struct *vinit_task;
 		char **helper_argv;
 		struct subprocess_info *sub_info;
 
@@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
+		rcu_read_lock();
+		vinit_task = find_task_by_vpid(1);
+		rcu_read_unlock();
+		if (!vinit_task) {
+			printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
+			goto fail_dropcount;
+		}
+
 		helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
 		if (!helper_argv) {
 			printk(KERN_WARNING "%s failed to allocate memory\n",
@@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
+		get_task_struct(vinit_task);
+
+		cprm.base_task = vinit_task;
+
 		retval = -ENOMEM;
 		sub_info = call_usermodehelper_setup(helper_argv[0],
 						helper_argv, NULL, GFP_KERNEL,
@@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
 			retval = call_usermodehelper_exec(sub_info,
 							  UMH_WAIT_EXEC);
 
+		put_task_struct(vinit_task);
 		argv_free(helper_argv);
 		if (retval) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 314b3ca..0c9a72c 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -59,6 +59,7 @@ struct linux_binprm {
 
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
+	struct task_struct *base_task;
 	const siginfo_t *siginfo;
 	struct pt_regs *regs;
 	struct file *file;
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/2] Make core_pattern support namespace
  2016-08-02  9:08 ` Zhao Lei
@ 2016-08-02  9:08     ` Zhao Lei
  -1 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Currently, each container shared one copy of coredump setting
with the host system, if host system changed the setting, each
running containers will be affected.
Same story happened when container changed core_pattern, both
host and other container will be affected.

For container based on namespace design, it is good to allow
each container keeping their own coredump setting.

It will bring us following benefit:
1: Each container can change their own coredump setting
   based on operation on /proc/sys/kernel/core_pattern
2: Coredump setting changed in host will not affect
   running containers.
3: Support both case of "putting coredump in guest" and
   "putting curedump in host".

Each namespace-based software(lxc, docker, ..) can use this function
to custom their dump setting.

And this function makes each continer working as separate system,
it fit for design goal of namespace.

Test(in lxc):
 # In the host
 # ----------------
 # echo host_core >/proc/sys/kernel/core_pattern
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ulimit -c 1024000
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:02 host_core.2175
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

 # In the container
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # echo container_core >/proc/sys/kernel/core_pattern
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rwxr-xr-x    1 root     root       759731 Feb  4 10:45 make_dump
 -rw-------    1 root     root       331776 Feb  4 10:45 container_core.16
 #

 # Return to host
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ls
 host_core.2175  make_dump  make_dump.c
 # rm -f host_core.2175
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:49 host_core.2351
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 fs/coredump.c                 | 25 ++++++++++++++++------
 include/linux/pid_namespace.h |  3 +++
 kernel/pid.c                  |  2 ++
 kernel/pid_namespace.c        |  2 ++
 kernel/sysctl.c               | 50 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 8511267..b9430cd 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -49,7 +49,6 @@
 
 int core_uses_pid;
 unsigned int core_pipe_limit;
-char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 
 struct core_name {
@@ -57,8 +56,6 @@ struct core_name {
 	int used, size;
 };
 
-/* The maximal length of core_pattern is also specified in sysctl.c */
-
 static int expand_corename(struct core_name *cn, int size)
 {
 	char *corename = krealloc(cn->corename, size, GFP_KERNEL);
@@ -183,10 +180,10 @@ put_exe_file:
  * name into corename, which must have space for at least
  * CORENAME_MAX_SIZE bytes plus one byte for the zero terminator.
  */
-static int format_corename(struct core_name *cn, struct coredump_params *cprm)
+static int format_corename(struct core_name *cn, const char *pat_ptr,
+			   struct coredump_params *cprm)
 {
 	const struct cred *cred = current_cred();
-	const char *pat_ptr = core_pattern;
 	int ispipe = (*pat_ptr == '|');
 	int pid_in_pattern = 0;
 	int err = 0;
@@ -627,6 +624,8 @@ void do_coredump(const siginfo_t *siginfo)
 		 */
 		.mm_flags = mm->flags,
 	};
+	struct pid_namespace *pid_ns;
+	char core_pattern[CORENAME_MAX_SIZE];
 
 	audit_core_dumps(siginfo->si_signo);
 
@@ -636,6 +635,18 @@ void do_coredump(const siginfo_t *siginfo)
 	if (!__get_dumpable(cprm.mm_flags))
 		goto fail;
 
+	pid_ns = task_active_pid_ns(current);
+	spin_lock(&pid_ns->core_pattern_lock);
+	while (pid_ns != &init_pid_ns) {
+		if (pid_ns->core_pattern[0])
+			break;
+		spin_unlock(&pid_ns->core_pattern_lock);
+		pid_ns = pid_ns->parent,
+		spin_lock(&pid_ns->core_pattern_lock);
+	}
+	strcpy(core_pattern, pid_ns->core_pattern);
+	spin_unlock(&pid_ns->core_pattern_lock);
+
 	cred = prepare_creds();
 	if (!cred)
 		goto fail;
@@ -657,7 +668,7 @@ void do_coredump(const siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm);
+	ispipe = format_corename(&cn, core_pattern, &cprm);
 
 	if (ispipe) {
 		int dump_count;
@@ -704,7 +715,7 @@ void do_coredump(const siginfo_t *siginfo)
 		}
 
 		rcu_read_lock();
-		vinit_task = find_task_by_vpid(1);
+		vinit_task = find_task_by_pid_ns(1, pid_ns);
 		rcu_read_unlock();
 		if (!vinit_task) {
 			printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 918b117..9a6ec4e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/kref.h>
 #include <linux/ns_common.h>
+#include <linux/binfmts.h>
 
 struct pidmap {
        atomic_t nr_free;
@@ -45,6 +46,8 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+	spinlock_t core_pattern_lock;
+	char core_pattern[CORENAME_MAX_SIZE];
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index f66162f..e7ee122 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,8 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+	.core_pattern_lock = __SPIN_LOCK_UNLOCKED(init_pid_ns.core_pattern_lock),
+	.core_pattern = "core",
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a65ba13..b5ce9d1 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -123,6 +123,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	for (i = 1; i < PIDMAP_ENTRIES; i++)
 		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
+	spin_lock_init(&ns->core_pattern_lock);
+
 	return ns;
 
 out_free_map:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87b2fc3..ee4b4ff 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -484,7 +484,7 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "core_pattern",
-		.data		= core_pattern,
+		.data		= NULL,
 		.maxlen		= CORENAME_MAX_SIZE,
 		.mode		= 0644,
 		.proc_handler	= proc_dostring_coredump,
@@ -2350,12 +2350,20 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 static void validate_coredump_safety(void)
 {
 #ifdef CONFIG_COREDUMP
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+	const char *core_pattern;
+
+	spin_lock(&pid_ns->core_pattern_lock);
+	core_pattern = pid_ns->core_pattern;
+
 	if (suid_dumpable == SUID_DUMP_ROOT &&
 	    core_pattern[0] != '/' && core_pattern[0] != '|') {
 		printk(KERN_WARNING "Unsafe core_pattern used with "\
 			"suid_dumpable=2. Pipe handler or fully qualified "\
 			"core dump path required.\n");
 	}
+
+	spin_unlock(&pid_ns->core_pattern_lock);
 #endif
 }
 
@@ -2372,10 +2380,42 @@ static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 static int proc_dostring_coredump(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
-	if (!error)
-		validate_coredump_safety();
-	return error;
+	int ret;
+	char core_pattern[CORENAME_MAX_SIZE];
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+
+	if (write) {
+		if (*ppos && sysctl_writes_strict == SYSCTL_WRITES_WARN)
+			warn_sysctl_write(table);
+
+		ret = _proc_do_string(core_pattern, table->maxlen, write,
+				      (char __user *)buffer, lenp, ppos);
+		if (ret)
+			return ret;
+
+		spin_lock(&pid_ns->core_pattern_lock);
+		strcpy(pid_ns->core_pattern, core_pattern);
+		spin_unlock(&pid_ns->core_pattern_lock);
+	} else {
+		spin_lock(&pid_ns->core_pattern_lock);
+		while (pid_ns != &init_pid_ns) {
+			if (pid_ns->core_pattern[0])
+				break;
+			spin_unlock(&pid_ns->core_pattern_lock);
+			pid_ns = pid_ns->parent,
+			spin_lock(&pid_ns->core_pattern_lock);
+		}
+		strcpy(core_pattern, pid_ns->core_pattern);
+		spin_unlock(&pid_ns->core_pattern_lock);
+
+		ret = _proc_do_string(core_pattern, table->maxlen, write,
+				      (char __user *)buffer, lenp, ppos);
+		if (ret)
+			return ret;
+	}
+
+	validate_coredump_safety();
+	return 0;
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/2] Make core_pattern support namespace
@ 2016-08-02  9:08     ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-02  9:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Eric W. Biederman, Mateusz Guzik, Kamezawa Hiroyuki,
	Stéphane Graber, Zhao Lei

Currently, each container shared one copy of coredump setting
with the host system, if host system changed the setting, each
running containers will be affected.
Same story happened when container changed core_pattern, both
host and other container will be affected.

For container based on namespace design, it is good to allow
each container keeping their own coredump setting.

It will bring us following benefit:
1: Each container can change their own coredump setting
   based on operation on /proc/sys/kernel/core_pattern
2: Coredump setting changed in host will not affect
   running containers.
3: Support both case of "putting coredump in guest" and
   "putting curedump in host".

Each namespace-based software(lxc, docker, ..) can use this function
to custom their dump setting.

And this function makes each continer working as separate system,
it fit for design goal of namespace.

Test(in lxc):
 # In the host
 # ----------------
 # echo host_core >/proc/sys/kernel/core_pattern
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ulimit -c 1024000
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:02 host_core.2175
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

 # In the container
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # echo container_core >/proc/sys/kernel/core_pattern
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rwxr-xr-x    1 root     root       759731 Feb  4 10:45 make_dump
 -rw-------    1 root     root       331776 Feb  4 10:45 container_core.16
 #

 # Return to host
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ls
 host_core.2175  make_dump  make_dump.c
 # rm -f host_core.2175
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:49 host_core.2351
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/coredump.c                 | 25 ++++++++++++++++------
 include/linux/pid_namespace.h |  3 +++
 kernel/pid.c                  |  2 ++
 kernel/pid_namespace.c        |  2 ++
 kernel/sysctl.c               | 50 ++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 8511267..b9430cd 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -49,7 +49,6 @@
 
 int core_uses_pid;
 unsigned int core_pipe_limit;
-char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 
 struct core_name {
@@ -57,8 +56,6 @@ struct core_name {
 	int used, size;
 };
 
-/* The maximal length of core_pattern is also specified in sysctl.c */
-
 static int expand_corename(struct core_name *cn, int size)
 {
 	char *corename = krealloc(cn->corename, size, GFP_KERNEL);
@@ -183,10 +180,10 @@ put_exe_file:
  * name into corename, which must have space for at least
  * CORENAME_MAX_SIZE bytes plus one byte for the zero terminator.
  */
-static int format_corename(struct core_name *cn, struct coredump_params *cprm)
+static int format_corename(struct core_name *cn, const char *pat_ptr,
+			   struct coredump_params *cprm)
 {
 	const struct cred *cred = current_cred();
-	const char *pat_ptr = core_pattern;
 	int ispipe = (*pat_ptr == '|');
 	int pid_in_pattern = 0;
 	int err = 0;
@@ -627,6 +624,8 @@ void do_coredump(const siginfo_t *siginfo)
 		 */
 		.mm_flags = mm->flags,
 	};
+	struct pid_namespace *pid_ns;
+	char core_pattern[CORENAME_MAX_SIZE];
 
 	audit_core_dumps(siginfo->si_signo);
 
@@ -636,6 +635,18 @@ void do_coredump(const siginfo_t *siginfo)
 	if (!__get_dumpable(cprm.mm_flags))
 		goto fail;
 
+	pid_ns = task_active_pid_ns(current);
+	spin_lock(&pid_ns->core_pattern_lock);
+	while (pid_ns != &init_pid_ns) {
+		if (pid_ns->core_pattern[0])
+			break;
+		spin_unlock(&pid_ns->core_pattern_lock);
+		pid_ns = pid_ns->parent,
+		spin_lock(&pid_ns->core_pattern_lock);
+	}
+	strcpy(core_pattern, pid_ns->core_pattern);
+	spin_unlock(&pid_ns->core_pattern_lock);
+
 	cred = prepare_creds();
 	if (!cred)
 		goto fail;
@@ -657,7 +668,7 @@ void do_coredump(const siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm);
+	ispipe = format_corename(&cn, core_pattern, &cprm);
 
 	if (ispipe) {
 		int dump_count;
@@ -704,7 +715,7 @@ void do_coredump(const siginfo_t *siginfo)
 		}
 
 		rcu_read_lock();
-		vinit_task = find_task_by_vpid(1);
+		vinit_task = find_task_by_pid_ns(1, pid_ns);
 		rcu_read_unlock();
 		if (!vinit_task) {
 			printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 918b117..9a6ec4e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/kref.h>
 #include <linux/ns_common.h>
+#include <linux/binfmts.h>
 
 struct pidmap {
        atomic_t nr_free;
@@ -45,6 +46,8 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+	spinlock_t core_pattern_lock;
+	char core_pattern[CORENAME_MAX_SIZE];
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index f66162f..e7ee122 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,8 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+	.core_pattern_lock = __SPIN_LOCK_UNLOCKED(init_pid_ns.core_pattern_lock),
+	.core_pattern = "core",
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a65ba13..b5ce9d1 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -123,6 +123,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	for (i = 1; i < PIDMAP_ENTRIES; i++)
 		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
+	spin_lock_init(&ns->core_pattern_lock);
+
 	return ns;
 
 out_free_map:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87b2fc3..ee4b4ff 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -484,7 +484,7 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "core_pattern",
-		.data		= core_pattern,
+		.data		= NULL,
 		.maxlen		= CORENAME_MAX_SIZE,
 		.mode		= 0644,
 		.proc_handler	= proc_dostring_coredump,
@@ -2350,12 +2350,20 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 static void validate_coredump_safety(void)
 {
 #ifdef CONFIG_COREDUMP
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+	const char *core_pattern;
+
+	spin_lock(&pid_ns->core_pattern_lock);
+	core_pattern = pid_ns->core_pattern;
+
 	if (suid_dumpable == SUID_DUMP_ROOT &&
 	    core_pattern[0] != '/' && core_pattern[0] != '|') {
 		printk(KERN_WARNING "Unsafe core_pattern used with "\
 			"suid_dumpable=2. Pipe handler or fully qualified "\
 			"core dump path required.\n");
 	}
+
+	spin_unlock(&pid_ns->core_pattern_lock);
 #endif
 }
 
@@ -2372,10 +2380,42 @@ static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 static int proc_dostring_coredump(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
-	if (!error)
-		validate_coredump_safety();
-	return error;
+	int ret;
+	char core_pattern[CORENAME_MAX_SIZE];
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+
+	if (write) {
+		if (*ppos && sysctl_writes_strict == SYSCTL_WRITES_WARN)
+			warn_sysctl_write(table);
+
+		ret = _proc_do_string(core_pattern, table->maxlen, write,
+				      (char __user *)buffer, lenp, ppos);
+		if (ret)
+			return ret;
+
+		spin_lock(&pid_ns->core_pattern_lock);
+		strcpy(pid_ns->core_pattern, core_pattern);
+		spin_unlock(&pid_ns->core_pattern_lock);
+	} else {
+		spin_lock(&pid_ns->core_pattern_lock);
+		while (pid_ns != &init_pid_ns) {
+			if (pid_ns->core_pattern[0])
+				break;
+			spin_unlock(&pid_ns->core_pattern_lock);
+			pid_ns = pid_ns->parent,
+			spin_lock(&pid_ns->core_pattern_lock);
+		}
+		strcpy(core_pattern, pid_ns->core_pattern);
+		spin_unlock(&pid_ns->core_pattern_lock);
+
+		ret = _proc_do_string(core_pattern, table->maxlen, write,
+				      (char __user *)buffer, lenp, ppos);
+		if (ret)
+			return ret;
+	}
+
+	validate_coredump_safety();
+	return 0;
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
  2016-08-02  9:08     ` Zhao Lei
@ 2016-08-05  6:32         ` Andrei Vagin
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrei Vagin @ 2016-08-05  6:32 UTC (permalink / raw)
  To: Zhao Lei; +Cc: Linux Containers, LKML, Eric W. Biederman

On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Currently when we set core_pattern to a pipe, the pipe program is
> forked by kthread running with root's permission, and write dumpfile
> into host's filesystem.
> Same thing happened for container, the dumper and dumpfile are also
> in host(not in container).
>
> It have following program:
> 1: Not consistent with file_type core_pattern
>    When we set core_pattern to a file, the container will write dump
>    into container's filesystem instead of host.
> 2: Not safe for privileged container
>    In a privileged container, user can destroy host system by following
>    command:
>    # # In a container
>    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
>    # make_dump
>
> This patch switch dumper program's environment to init task, so, for
> container, dumper program have same environment with init task in
> container, which make dumper program put in container's filesystem, and
> write coredump into container's filesystem.
> The dumper's permission is also limited into subset of container's init
> process.

Does it change the current behavior? A pid namespace may be used for
sandboxes. For example, chrome uses it. In this case, we probably want
to collect core dumps from a root pid namespace.

If we are going to virtualize core_pattern relative to the pid
namespace, maybe it's better to make it optional for a pid namespace.
I mean it core_pattern is not set for the current pid namespace, we
can check a parent pid namespace and so on. A helper will be executed
in a pid namespace, where core_pattern is set.

After reading these patches, I think it may be a good idea to add one
more mode to handle core files, when a core file is sent to an
existing process instead of executing a usermode helpers. This mode
will cover of use-cases where the pipe mode is used now. And it looks
more secure, because in this case we control namespaces and credential
for the abort daemon (a process which handles core files).u

How it can be done. An abort daemon creates an abstract listening unix
socket and writes its name into /proc/sys/kernel/core_pattern. The
kernel saves this name and the network namespace. Then when any
process is crashed, the kernel creates a new unix socket and connect
it to the socket of the abort daemon and streams a core file into this
socket.

>
> Suggested-by: Eric W. Biederman <ebieyderm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>
> Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  fs/coredump.c           | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/binfmts.h |  1 +
>  2 files changed, 87 insertions(+), 1 deletion(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 281b768..8511267 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>  {
>         struct file *files[2];
>         struct coredump_params *cp = (struct coredump_params *)info->data;
> +       struct task_struct *base_task;
> +
>         int err = create_pipe_files(files, 0);
>         if (err)
>                 return err;
> @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>
>         err = replace_fd(0, files[0], 0);
>         fput(files[0]);
> +       if (err)
> +               return err;
> +
>         /* and disallow core files too */
>         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
>
> -       return err;
> +       base_task = cp->base_task;
> +       if (base_task) {
> +               const struct cred *base_cred;
> +
> +               /* Set fs_root to base_task */
> +               spin_lock(&base_task->fs->lock);
> +               set_fs_root(current->fs, &base_task->fs->root);
> +               spin_unlock(&base_task->fs->lock);
> +
> +               /* Set namespaces to base_task */
> +               get_nsproxy(base_task->nsproxy);
> +               switch_task_namespaces(current, base_task->nsproxy);

This task will continue running in the current pid namespace, because
switch_task_namespaces doesn't change a pid for the task. Ussualy, we
need to make setns+fork to switch a pid namespace.

> +
> +               /* Set cgroup to base_task */
> +               current->flags &= ~PF_NO_SETAFFINITY;
> +               err = cgroup_attach_task_all(base_task, current);
> +               if (err < 0)
> +                       return err;
> +
> +               /* Set cred to base_task */
> +               base_cred = get_task_cred(base_task);
> +

I think you can use prepare_kernel_cred here

> +               new->uid   = base_cred->uid;
> +               new->gid   = base_cred->gid;
> +               new->suid  = base_cred->suid;
> +               new->sgid  = base_cred->sgid;
> +               new->euid  = base_cred->euid;
> +               new->egid  = base_cred->egid;
> +               new->fsuid = base_cred->fsuid;
> +               new->fsgid = base_cred->fsgid;
> +
> +               new->securebits = base_cred->securebits;
> +
> +               new->cap_inheritable = base_cred->cap_inheritable;
> +               new->cap_permitted   = base_cred->cap_permitted;
> +               new->cap_effective   = base_cred->cap_effective;
> +               new->cap_bset        = base_cred->cap_bset;
> +               new->cap_ambient     = base_cred->cap_ambient;
> +
> +               security_cred_free(new);
> +#ifdef CONFIG_SECURITY
> +               new->security = NULL;
> +#endif
> +               err = security_prepare_creds(new, base_cred, GFP_KERNEL);
> +               if (err < 0) {
> +                       put_cred(base_cred);
> +                       return err;
> +               }
> +
> +               free_uid(new->user);
> +               new->user = base_cred->user;
> +               get_uid(new->user);
> +
> +               put_user_ns(new->user_ns);
> +               new->user_ns = base_cred->user_ns;
> +               get_user_ns(new->user_ns);
> +
> +               put_group_info(new->group_info);
> +               new->group_info = base_cred->group_info;
> +               get_group_info(new->group_info);
> +
> +               put_cred(base_cred);
> +
> +               validate_creds(new);
> +       }
> +
> +       return 0;
>  }
>
>  void do_coredump(const siginfo_t *siginfo)
> @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
>
>         if (ispipe) {
>                 int dump_count;
> +               struct task_struct *vinit_task;
>                 char **helper_argv;
>                 struct subprocess_info *sub_info;
>
> @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
>                         goto fail_dropcount;
>                 }
>
> +               rcu_read_lock();
> +               vinit_task = find_task_by_vpid(1);
> +               rcu_read_unlock();
> +               if (!vinit_task) {
> +                       printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
> +                       goto fail_dropcount;
> +               }
> +
>                 helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
>                 if (!helper_argv) {
>                         printk(KERN_WARNING "%s failed to allocate memory\n",
> @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
>                         goto fail_dropcount;
>                 }
>
> +               get_task_struct(vinit_task);
> +
> +               cprm.base_task = vinit_task;
> +
>                 retval = -ENOMEM;
>                 sub_info = call_usermodehelper_setup(helper_argv[0],
>                                                 helper_argv, NULL, GFP_KERNEL,
> @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
>                         retval = call_usermodehelper_exec(sub_info,
>                                                           UMH_WAIT_EXEC);
>
> +               put_task_struct(vinit_task);
>                 argv_free(helper_argv);
>                 if (retval) {
>                         printk(KERN_INFO "Core dump to |%s pipe failed\n",
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 314b3ca..0c9a72c 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -59,6 +59,7 @@ struct linux_binprm {
>
>  /* Function parameter for binfmt->coredump */
>  struct coredump_params {
> +       struct task_struct *base_task;
>         const siginfo_t *siginfo;
>         struct pt_regs *regs;
>         struct file *file;
> --
> 1.8.5.1
>
>
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
@ 2016-08-05  6:32         ` Andrei Vagin
  0 siblings, 0 replies; 14+ messages in thread
From: Andrei Vagin @ 2016-08-05  6:32 UTC (permalink / raw)
  To: Zhao Lei; +Cc: LKML, Linux Containers, Eric W. Biederman

On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Currently when we set core_pattern to a pipe, the pipe program is
> forked by kthread running with root's permission, and write dumpfile
> into host's filesystem.
> Same thing happened for container, the dumper and dumpfile are also
> in host(not in container).
>
> It have following program:
> 1: Not consistent with file_type core_pattern
>    When we set core_pattern to a file, the container will write dump
>    into container's filesystem instead of host.
> 2: Not safe for privileged container
>    In a privileged container, user can destroy host system by following
>    command:
>    # # In a container
>    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
>    # make_dump
>
> This patch switch dumper program's environment to init task, so, for
> container, dumper program have same environment with init task in
> container, which make dumper program put in container's filesystem, and
> write coredump into container's filesystem.
> The dumper's permission is also limited into subset of container's init
> process.

Does it change the current behavior? A pid namespace may be used for
sandboxes. For example, chrome uses it. In this case, we probably want
to collect core dumps from a root pid namespace.

If we are going to virtualize core_pattern relative to the pid
namespace, maybe it's better to make it optional for a pid namespace.
I mean it core_pattern is not set for the current pid namespace, we
can check a parent pid namespace and so on. A helper will be executed
in a pid namespace, where core_pattern is set.

After reading these patches, I think it may be a good idea to add one
more mode to handle core files, when a core file is sent to an
existing process instead of executing a usermode helpers. This mode
will cover of use-cases where the pipe mode is used now. And it looks
more secure, because in this case we control namespaces and credential
for the abort daemon (a process which handles core files).u

How it can be done. An abort daemon creates an abstract listening unix
socket and writes its name into /proc/sys/kernel/core_pattern. The
kernel saves this name and the network namespace. Then when any
process is crashed, the kernel creates a new unix socket and connect
it to the socket of the abort daemon and streams a core file into this
socket.

>
> Suggested-by: Eric W. Biederman <ebieyderm@xmission.com>
> Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> ---
>  fs/coredump.c           | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/binfmts.h |  1 +
>  2 files changed, 87 insertions(+), 1 deletion(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 281b768..8511267 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>  {
>         struct file *files[2];
>         struct coredump_params *cp = (struct coredump_params *)info->data;
> +       struct task_struct *base_task;
> +
>         int err = create_pipe_files(files, 0);
>         if (err)
>                 return err;
> @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>
>         err = replace_fd(0, files[0], 0);
>         fput(files[0]);
> +       if (err)
> +               return err;
> +
>         /* and disallow core files too */
>         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
>
> -       return err;
> +       base_task = cp->base_task;
> +       if (base_task) {
> +               const struct cred *base_cred;
> +
> +               /* Set fs_root to base_task */
> +               spin_lock(&base_task->fs->lock);
> +               set_fs_root(current->fs, &base_task->fs->root);
> +               spin_unlock(&base_task->fs->lock);
> +
> +               /* Set namespaces to base_task */
> +               get_nsproxy(base_task->nsproxy);
> +               switch_task_namespaces(current, base_task->nsproxy);

This task will continue running in the current pid namespace, because
switch_task_namespaces doesn't change a pid for the task. Ussualy, we
need to make setns+fork to switch a pid namespace.

> +
> +               /* Set cgroup to base_task */
> +               current->flags &= ~PF_NO_SETAFFINITY;
> +               err = cgroup_attach_task_all(base_task, current);
> +               if (err < 0)
> +                       return err;
> +
> +               /* Set cred to base_task */
> +               base_cred = get_task_cred(base_task);
> +

I think you can use prepare_kernel_cred here

> +               new->uid   = base_cred->uid;
> +               new->gid   = base_cred->gid;
> +               new->suid  = base_cred->suid;
> +               new->sgid  = base_cred->sgid;
> +               new->euid  = base_cred->euid;
> +               new->egid  = base_cred->egid;
> +               new->fsuid = base_cred->fsuid;
> +               new->fsgid = base_cred->fsgid;
> +
> +               new->securebits = base_cred->securebits;
> +
> +               new->cap_inheritable = base_cred->cap_inheritable;
> +               new->cap_permitted   = base_cred->cap_permitted;
> +               new->cap_effective   = base_cred->cap_effective;
> +               new->cap_bset        = base_cred->cap_bset;
> +               new->cap_ambient     = base_cred->cap_ambient;
> +
> +               security_cred_free(new);
> +#ifdef CONFIG_SECURITY
> +               new->security = NULL;
> +#endif
> +               err = security_prepare_creds(new, base_cred, GFP_KERNEL);
> +               if (err < 0) {
> +                       put_cred(base_cred);
> +                       return err;
> +               }
> +
> +               free_uid(new->user);
> +               new->user = base_cred->user;
> +               get_uid(new->user);
> +
> +               put_user_ns(new->user_ns);
> +               new->user_ns = base_cred->user_ns;
> +               get_user_ns(new->user_ns);
> +
> +               put_group_info(new->group_info);
> +               new->group_info = base_cred->group_info;
> +               get_group_info(new->group_info);
> +
> +               put_cred(base_cred);
> +
> +               validate_creds(new);
> +       }
> +
> +       return 0;
>  }
>
>  void do_coredump(const siginfo_t *siginfo)
> @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
>
>         if (ispipe) {
>                 int dump_count;
> +               struct task_struct *vinit_task;
>                 char **helper_argv;
>                 struct subprocess_info *sub_info;
>
> @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
>                         goto fail_dropcount;
>                 }
>
> +               rcu_read_lock();
> +               vinit_task = find_task_by_vpid(1);
> +               rcu_read_unlock();
> +               if (!vinit_task) {
> +                       printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
> +                       goto fail_dropcount;
> +               }
> +
>                 helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
>                 if (!helper_argv) {
>                         printk(KERN_WARNING "%s failed to allocate memory\n",
> @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
>                         goto fail_dropcount;
>                 }
>
> +               get_task_struct(vinit_task);
> +
> +               cprm.base_task = vinit_task;
> +
>                 retval = -ENOMEM;
>                 sub_info = call_usermodehelper_setup(helper_argv[0],
>                                                 helper_argv, NULL, GFP_KERNEL,
> @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
>                         retval = call_usermodehelper_exec(sub_info,
>                                                           UMH_WAIT_EXEC);
>
> +               put_task_struct(vinit_task);
>                 argv_free(helper_argv);
>                 if (retval) {
>                         printk(KERN_INFO "Core dump to |%s pipe failed\n",
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 314b3ca..0c9a72c 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -59,6 +59,7 @@ struct linux_binprm {
>
>  /* Function parameter for binfmt->coredump */
>  struct coredump_params {
> +       struct task_struct *base_task;
>         const siginfo_t *siginfo;
>         struct pt_regs *regs;
>         struct file *file;
> --
> 1.8.5.1
>
>
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
  2016-08-05  6:32         ` Andrei Vagin
@ 2016-08-05  7:52             ` Zhao Lei
  -1 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-05  7:52 UTC (permalink / raw)
  To: 'Andrei Vagin'
  Cc: 'Linux Containers', 'LKML', 'Eric W. Biederman'

Hi, Andrei Vagin

Thanks for your detailed review and suggestion.

> -----Original Message-----
> From: Andrei Vagin [mailto:avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> Sent: Friday, August 05, 2016 2:32 PM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; Linux Containers
> <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>; Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> container
> 
> On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> > Currently when we set core_pattern to a pipe, the pipe program is
> > forked by kthread running with root's permission, and write dumpfile
> > into host's filesystem.
> > Same thing happened for container, the dumper and dumpfile are also
> > in host(not in container).
> >
> > It have following program:
> > 1: Not consistent with file_type core_pattern
> >    When we set core_pattern to a file, the container will write dump
> >    into container's filesystem instead of host.
> > 2: Not safe for privileged container
> >    In a privileged container, user can destroy host system by following
> >    command:
> >    # # In a container
> >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> >    # make_dump
> >
> > This patch switch dumper program's environment to init task, so, for
> > container, dumper program have same environment with init task in
> > container, which make dumper program put in container's filesystem, and
> > write coredump into container's filesystem.
> > The dumper's permission is also limited into subset of container's init
> > process.
> 
> Does it change the current behavior? A pid namespace may be used for
> sandboxes. For example, chrome uses it. In this case, we probably want
> to collect core dumps from a root pid namespace.
> 
> If we are going to virtualize core_pattern relative to the pid
> namespace, maybe it's better to make it optional for a pid namespace.
> I mean it core_pattern is not set for the current pid namespace, we
> can check a parent pid namespace and so on. A helper will be executed
> in a pid namespace, where core_pattern is set.
> 
Is it means, we don't want to always gather dump from the container ifself,
wnd we want to custom which level of the container can gather dump
information?

If it is exact, current patch have similar function as above.

If user had not set core_pattern in a namespace, the dump setting in
the parent namespace will be used.
For example,
HOST        change core_pattern
Container1  no change
Container2  change core_pattern
Container3  no change
Container4  no change

If code dump happened in container 2, 3 and 4, the dumper's environment
is same as container2.
And if dump happened in container 1 and HOST, the dumper's environment
will be same as HOST.

So,
1: if a container want to make the custom core dump,
  it can set core_pattern in container, and the gathered information
  will be limited in this container.
2: If the host(or some top level container) want to gather dump information
  more than a container internal, the host(or some top level container)
  can set core_pattern, and not change core_pattern in the container,
  then the pipe program will be run in same environment as the host,
  and gather information from host's env.

> After reading these patches, I think it may be a good idea to add one
> more mode to handle core files, when a core file is sent to an
> existing process instead of executing a usermode helpers. This mode
> will cover of use-cases where the pipe mode is used now. And it looks
> more secure, because in this case we control namespaces and credential
> for the abort daemon (a process which handles core files).u
> 
> How it can be done. An abort daemon creates an abstract listening unix
> socket and writes its name into /proc/sys/kernel/core_pattern. The
> kernel saves this name and the network namespace. Then when any
> process is crashed, the kernel creates a new unix socket and connect
> it to the socket of the abort daemon and streams a core file into this
> socket.
> 
Good idea.

And do it means if we want to custom core_pattern for each container,
we can create more than one abort daemon for it?
If the abort daemon is running in the host, the dump file will be write
into host's env, and if abort daemon is running in container, the dump
file will be write into container's fs.

> >
> > Suggested-by: Eric W. Biederman <ebieyderm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> > Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> >
> > Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > ---
> >  fs/coredump.c           | 87
> ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/binfmts.h |  1 +
> >  2 files changed, 87 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 281b768..8511267 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info
> *info, struct cred *new)
> >  {
> >         struct file *files[2];
> >         struct coredump_params *cp = (struct coredump_params
> *)info->data;
> > +       struct task_struct *base_task;
> > +
> >         int err = create_pipe_files(files, 0);
> >         if (err)
> >                 return err;
> > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info
> *info, struct cred *new)
> >
> >         err = replace_fd(0, files[0], 0);
> >         fput(files[0]);
> > +       if (err)
> > +               return err;
> > +
> >         /* and disallow core files too */
> >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> >
> > -       return err;
> > +       base_task = cp->base_task;
> > +       if (base_task) {
> > +               const struct cred *base_cred;
> > +
> > +               /* Set fs_root to base_task */
> > +               spin_lock(&base_task->fs->lock);
> > +               set_fs_root(current->fs, &base_task->fs->root);
> > +               spin_unlock(&base_task->fs->lock);
> > +
> > +               /* Set namespaces to base_task */
> > +               get_nsproxy(base_task->nsproxy);
> > +               switch_task_namespaces(current, base_task->nsproxy);
> 
> This task will continue running in the current pid namespace, because
> switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> need to make setns+fork to switch a pid namespace.
> 
Yes.
Now I known why I had not find this problem in my test,
I output pids by a shell command, and the shell command is already got forked.

Seems need double fork(make code complex...) to do it.

> > +
> > +               /* Set cgroup to base_task */
> > +               current->flags &= ~PF_NO_SETAFFINITY;
> > +               err = cgroup_attach_task_all(base_task, current);
> > +               if (err < 0)
> > +                       return err;
> > +
> > +               /* Set cred to base_task */
> > +               base_cred = get_task_cred(base_task);
> > +
> 
> I think you can use prepare_kernel_cred here
> 
prepare_kernel_cred() is called by caller of this function(call_usermodehelper_exec_async()),
and this function is a hook, which can set additional cred property like:
call_usermodehelper_exec_async()
{
    prepare_kernel_cred()
    this_function()
    {
        custom_cred struct
    } 
    commit_creds(new);
}

So we can only change cred's container here.

But seems we need rewrite this code when we move to above "double fork".

Before rewrite, could you give me some suggestion, if current code
already can custom to dump information of "container internal" and "parent level container",
is this meet your request?

Thanks
Zhaolei

> > +               new->uid   = base_cred->uid;
> > +               new->gid   = base_cred->gid;
> > +               new->suid  = base_cred->suid;
> > +               new->sgid  = base_cred->sgid;
> > +               new->euid  = base_cred->euid;
> > +               new->egid  = base_cred->egid;
> > +               new->fsuid = base_cred->fsuid;
> > +               new->fsgid = base_cred->fsgid;
> > +
> > +               new->securebits = base_cred->securebits;
> > +
> > +               new->cap_inheritable = base_cred->cap_inheritable;
> > +               new->cap_permitted   = base_cred->cap_permitted;
> > +               new->cap_effective   = base_cred->cap_effective;
> > +               new->cap_bset        = base_cred->cap_bset;
> > +               new->cap_ambient     = base_cred->cap_ambient;
> > +
> > +               security_cred_free(new);
> > +#ifdef CONFIG_SECURITY
> > +               new->security = NULL;
> > +#endif
> > +               err = security_prepare_creds(new, base_cred,
> GFP_KERNEL);
> > +               if (err < 0) {
> > +                       put_cred(base_cred);
> > +                       return err;
> > +               }
> > +
> > +               free_uid(new->user);
> > +               new->user = base_cred->user;
> > +               get_uid(new->user);
> > +
> > +               put_user_ns(new->user_ns);
> > +               new->user_ns = base_cred->user_ns;
> > +               get_user_ns(new->user_ns);
> > +
> > +               put_group_info(new->group_info);
> > +               new->group_info = base_cred->group_info;
> > +               get_group_info(new->group_info);
> > +
> > +               put_cred(base_cred);
> > +
> > +               validate_creds(new);
> > +       }
> > +
> > +       return 0;
> >  }
> >
> >  void do_coredump(const siginfo_t *siginfo)
> > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> >
> >         if (ispipe) {
> >                 int dump_count;
> > +               struct task_struct *vinit_task;
> >                 char **helper_argv;
> >                 struct subprocess_info *sub_info;
> >
> > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> >                         goto fail_dropcount;
> >                 }
> >
> > +               rcu_read_lock();
> > +               vinit_task = find_task_by_vpid(1);
> > +               rcu_read_unlock();
> > +               if (!vinit_task) {
> > +                       printk(KERN_WARNING "failed getting init task
> info, skipping core dump\n");
> > +                       goto fail_dropcount;
> > +               }
> > +
> >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> NULL);
> >                 if (!helper_argv) {
> >                         printk(KERN_WARNING "%s failed to allocate
> memory\n",
> > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> >                         goto fail_dropcount;
> >                 }
> >
> > +               get_task_struct(vinit_task);
> > +
> > +               cprm.base_task = vinit_task;
> > +
> >                 retval = -ENOMEM;
> >                 sub_info = call_usermodehelper_setup(helper_argv[0],
> >                                                 helper_argv,
> NULL, GFP_KERNEL,
> > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> >                         retval = call_usermodehelper_exec(sub_info,
> >
> UMH_WAIT_EXEC);
> >
> > +               put_task_struct(vinit_task);
> >                 argv_free(helper_argv);
> >                 if (retval) {
> >                         printk(KERN_INFO "Core dump to |%s pipe
> failed\n",
> > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > index 314b3ca..0c9a72c 100644
> > --- a/include/linux/binfmts.h
> > +++ b/include/linux/binfmts.h
> > @@ -59,6 +59,7 @@ struct linux_binprm {
> >
> >  /* Function parameter for binfmt->coredump */
> >  struct coredump_params {
> > +       struct task_struct *base_task;
> >         const siginfo_t *siginfo;
> >         struct pt_regs *regs;
> >         struct file *file;
> > --
> > 1.8.5.1
> >
> >
> >
> > _______________________________________________
> > Containers mailing list
> > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
@ 2016-08-05  7:52             ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-05  7:52 UTC (permalink / raw)
  To: 'Andrei Vagin'
  Cc: 'LKML', 'Linux Containers', 'Eric W. Biederman'

Hi, Andrei Vagin

Thanks for your detailed review and suggestion.

> -----Original Message-----
> From: Andrei Vagin [mailto:avagin@gmail.com]
> Sent: Friday, August 05, 2016 2:32 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: LKML <linux-kernel@vger.kernel.org>; Linux Containers
> <containers@lists.linux-foundation.org>; Eric W. Biederman
> <ebiederm@xmission.com>
> Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> container
> 
> On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > Currently when we set core_pattern to a pipe, the pipe program is
> > forked by kthread running with root's permission, and write dumpfile
> > into host's filesystem.
> > Same thing happened for container, the dumper and dumpfile are also
> > in host(not in container).
> >
> > It have following program:
> > 1: Not consistent with file_type core_pattern
> >    When we set core_pattern to a file, the container will write dump
> >    into container's filesystem instead of host.
> > 2: Not safe for privileged container
> >    In a privileged container, user can destroy host system by following
> >    command:
> >    # # In a container
> >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> >    # make_dump
> >
> > This patch switch dumper program's environment to init task, so, for
> > container, dumper program have same environment with init task in
> > container, which make dumper program put in container's filesystem, and
> > write coredump into container's filesystem.
> > The dumper's permission is also limited into subset of container's init
> > process.
> 
> Does it change the current behavior? A pid namespace may be used for
> sandboxes. For example, chrome uses it. In this case, we probably want
> to collect core dumps from a root pid namespace.
> 
> If we are going to virtualize core_pattern relative to the pid
> namespace, maybe it's better to make it optional for a pid namespace.
> I mean it core_pattern is not set for the current pid namespace, we
> can check a parent pid namespace and so on. A helper will be executed
> in a pid namespace, where core_pattern is set.
> 
Is it means, we don't want to always gather dump from the container ifself,
wnd we want to custom which level of the container can gather dump
information?

If it is exact, current patch have similar function as above.

If user had not set core_pattern in a namespace, the dump setting in
the parent namespace will be used.
For example,
HOST        change core_pattern
Container1  no change
Container2  change core_pattern
Container3  no change
Container4  no change

If code dump happened in container 2, 3 and 4, the dumper's environment
is same as container2.
And if dump happened in container 1 and HOST, the dumper's environment
will be same as HOST.

So,
1: if a container want to make the custom core dump,
  it can set core_pattern in container, and the gathered information
  will be limited in this container.
2: If the host(or some top level container) want to gather dump information
  more than a container internal, the host(or some top level container)
  can set core_pattern, and not change core_pattern in the container,
  then the pipe program will be run in same environment as the host,
  and gather information from host's env.

> After reading these patches, I think it may be a good idea to add one
> more mode to handle core files, when a core file is sent to an
> existing process instead of executing a usermode helpers. This mode
> will cover of use-cases where the pipe mode is used now. And it looks
> more secure, because in this case we control namespaces and credential
> for the abort daemon (a process which handles core files).u
> 
> How it can be done. An abort daemon creates an abstract listening unix
> socket and writes its name into /proc/sys/kernel/core_pattern. The
> kernel saves this name and the network namespace. Then when any
> process is crashed, the kernel creates a new unix socket and connect
> it to the socket of the abort daemon and streams a core file into this
> socket.
> 
Good idea.

And do it means if we want to custom core_pattern for each container,
we can create more than one abort daemon for it?
If the abort daemon is running in the host, the dump file will be write
into host's env, and if abort daemon is running in container, the dump
file will be write into container's fs.

> >
> > Suggested-by: Eric W. Biederman <ebieyderm@xmission.com>
> > Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > ---
> >  fs/coredump.c           | 87
> ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/binfmts.h |  1 +
> >  2 files changed, 87 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 281b768..8511267 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info
> *info, struct cred *new)
> >  {
> >         struct file *files[2];
> >         struct coredump_params *cp = (struct coredump_params
> *)info->data;
> > +       struct task_struct *base_task;
> > +
> >         int err = create_pipe_files(files, 0);
> >         if (err)
> >                 return err;
> > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info
> *info, struct cred *new)
> >
> >         err = replace_fd(0, files[0], 0);
> >         fput(files[0]);
> > +       if (err)
> > +               return err;
> > +
> >         /* and disallow core files too */
> >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> >
> > -       return err;
> > +       base_task = cp->base_task;
> > +       if (base_task) {
> > +               const struct cred *base_cred;
> > +
> > +               /* Set fs_root to base_task */
> > +               spin_lock(&base_task->fs->lock);
> > +               set_fs_root(current->fs, &base_task->fs->root);
> > +               spin_unlock(&base_task->fs->lock);
> > +
> > +               /* Set namespaces to base_task */
> > +               get_nsproxy(base_task->nsproxy);
> > +               switch_task_namespaces(current, base_task->nsproxy);
> 
> This task will continue running in the current pid namespace, because
> switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> need to make setns+fork to switch a pid namespace.
> 
Yes.
Now I known why I had not find this problem in my test,
I output pids by a shell command, and the shell command is already got forked.

Seems need double fork(make code complex...) to do it.

> > +
> > +               /* Set cgroup to base_task */
> > +               current->flags &= ~PF_NO_SETAFFINITY;
> > +               err = cgroup_attach_task_all(base_task, current);
> > +               if (err < 0)
> > +                       return err;
> > +
> > +               /* Set cred to base_task */
> > +               base_cred = get_task_cred(base_task);
> > +
> 
> I think you can use prepare_kernel_cred here
> 
prepare_kernel_cred() is called by caller of this function(call_usermodehelper_exec_async()),
and this function is a hook, which can set additional cred property like:
call_usermodehelper_exec_async()
{
    prepare_kernel_cred()
    this_function()
    {
        custom_cred struct
    } 
    commit_creds(new);
}

So we can only change cred's container here.

But seems we need rewrite this code when we move to above "double fork".

Before rewrite, could you give me some suggestion, if current code
already can custom to dump information of "container internal" and "parent level container",
is this meet your request?

Thanks
Zhaolei

> > +               new->uid   = base_cred->uid;
> > +               new->gid   = base_cred->gid;
> > +               new->suid  = base_cred->suid;
> > +               new->sgid  = base_cred->sgid;
> > +               new->euid  = base_cred->euid;
> > +               new->egid  = base_cred->egid;
> > +               new->fsuid = base_cred->fsuid;
> > +               new->fsgid = base_cred->fsgid;
> > +
> > +               new->securebits = base_cred->securebits;
> > +
> > +               new->cap_inheritable = base_cred->cap_inheritable;
> > +               new->cap_permitted   = base_cred->cap_permitted;
> > +               new->cap_effective   = base_cred->cap_effective;
> > +               new->cap_bset        = base_cred->cap_bset;
> > +               new->cap_ambient     = base_cred->cap_ambient;
> > +
> > +               security_cred_free(new);
> > +#ifdef CONFIG_SECURITY
> > +               new->security = NULL;
> > +#endif
> > +               err = security_prepare_creds(new, base_cred,
> GFP_KERNEL);
> > +               if (err < 0) {
> > +                       put_cred(base_cred);
> > +                       return err;
> > +               }
> > +
> > +               free_uid(new->user);
> > +               new->user = base_cred->user;
> > +               get_uid(new->user);
> > +
> > +               put_user_ns(new->user_ns);
> > +               new->user_ns = base_cred->user_ns;
> > +               get_user_ns(new->user_ns);
> > +
> > +               put_group_info(new->group_info);
> > +               new->group_info = base_cred->group_info;
> > +               get_group_info(new->group_info);
> > +
> > +               put_cred(base_cred);
> > +
> > +               validate_creds(new);
> > +       }
> > +
> > +       return 0;
> >  }
> >
> >  void do_coredump(const siginfo_t *siginfo)
> > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> >
> >         if (ispipe) {
> >                 int dump_count;
> > +               struct task_struct *vinit_task;
> >                 char **helper_argv;
> >                 struct subprocess_info *sub_info;
> >
> > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> >                         goto fail_dropcount;
> >                 }
> >
> > +               rcu_read_lock();
> > +               vinit_task = find_task_by_vpid(1);
> > +               rcu_read_unlock();
> > +               if (!vinit_task) {
> > +                       printk(KERN_WARNING "failed getting init task
> info, skipping core dump\n");
> > +                       goto fail_dropcount;
> > +               }
> > +
> >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> NULL);
> >                 if (!helper_argv) {
> >                         printk(KERN_WARNING "%s failed to allocate
> memory\n",
> > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> >                         goto fail_dropcount;
> >                 }
> >
> > +               get_task_struct(vinit_task);
> > +
> > +               cprm.base_task = vinit_task;
> > +
> >                 retval = -ENOMEM;
> >                 sub_info = call_usermodehelper_setup(helper_argv[0],
> >                                                 helper_argv,
> NULL, GFP_KERNEL,
> > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> >                         retval = call_usermodehelper_exec(sub_info,
> >
> UMH_WAIT_EXEC);
> >
> > +               put_task_struct(vinit_task);
> >                 argv_free(helper_argv);
> >                 if (retval) {
> >                         printk(KERN_INFO "Core dump to |%s pipe
> failed\n",
> > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > index 314b3ca..0c9a72c 100644
> > --- a/include/linux/binfmts.h
> > +++ b/include/linux/binfmts.h
> > @@ -59,6 +59,7 @@ struct linux_binprm {
> >
> >  /* Function parameter for binfmt->coredump */
> >  struct coredump_params {
> > +       struct task_struct *base_task;
> >         const siginfo_t *siginfo;
> >         struct pt_regs *regs;
> >         struct file *file;
> > --
> > 1.8.5.1
> >
> >
> >
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
  2016-08-05  7:52             ` Zhao Lei
@ 2016-08-05 17:13               ` 'Andrei Vagin'
  -1 siblings, 0 replies; 14+ messages in thread
From: 'Andrei Vagin' @ 2016-08-05 17:13 UTC (permalink / raw)
  To: Zhao Lei
  Cc: 'Linux Containers', 'LKML', 'Eric W. Biederman'

On Fri, Aug 05, 2016 at 03:52:25PM +0800, Zhao Lei wrote:
> Hi, Andrei Vagin
> 
> Thanks for your detailed review and suggestion.
> 
> > -----Original Message-----
> > From: Andrei Vagin [mailto:avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > Sent: Friday, August 05, 2016 2:32 PM
> > To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > Cc: LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; Linux Containers
> > <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>; Eric W. Biederman
> > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> > container
> > 
> > On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> > > Currently when we set core_pattern to a pipe, the pipe program is
> > > forked by kthread running with root's permission, and write dumpfile
> > > into host's filesystem.
> > > Same thing happened for container, the dumper and dumpfile are also
> > > in host(not in container).
> > >
> > > It have following program:
> > > 1: Not consistent with file_type core_pattern
> > >    When we set core_pattern to a file, the container will write dump
> > >    into container's filesystem instead of host.
> > > 2: Not safe for privileged container
> > >    In a privileged container, user can destroy host system by following
> > >    command:
> > >    # # In a container
> > >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> > >    # make_dump
> > >
> > > This patch switch dumper program's environment to init task, so, for
> > > container, dumper program have same environment with init task in
> > > container, which make dumper program put in container's filesystem, and
> > > write coredump into container's filesystem.
> > > The dumper's permission is also limited into subset of container's init
> > > process.
> > 
> > Does it change the current behavior? A pid namespace may be used for
> > sandboxes. For example, chrome uses it. In this case, we probably want
> > to collect core dumps from a root pid namespace.
> > 
> > If we are going to virtualize core_pattern relative to the pid
> > namespace, maybe it's better to make it optional for a pid namespace.
> > I mean it core_pattern is not set for the current pid namespace, we
> > can check a parent pid namespace and so on. A helper will be executed
> > in a pid namespace, where core_pattern is set.
> > 
> Is it means, we don't want to always gather dump from the container ifself,
> wnd we want to custom which level of the container can gather dump
> information?
> 
> If it is exact, current patch have similar function as above.
> 
> If user had not set core_pattern in a namespace, the dump setting in
> the parent namespace will be used.
> For example,
> HOST        change core_pattern
> Container1  no change
> Container2  change core_pattern
> Container3  no change
> Container4  no change
> 
> If code dump happened in container 2, 3 and 4, the dumper's environment
> is same as container2.
> And if dump happened in container 1 and HOST, the dumper's environment
> will be same as HOST.
> 
> So,
> 1: if a container want to make the custom core dump,
>   it can set core_pattern in container, and the gathered information
>   will be limited in this container.
> 2: If the host(or some top level container) want to gather dump information
>   more than a container internal, the host(or some top level container)
>   can set core_pattern, and not change core_pattern in the container,
>   then the pipe program will be run in same environment as the host,
>   and gather information from host's env.

The behavior which you described is the same with what I'm thinking
about.

But I don't understand how you implement it in a code.
I see that you call find_task_by_vpid(1) to get a base task, so the base
that is always the init task of the current pid namespace. I think we
need to get the init task from a pid namespace where core_pattern is
set, don't we?

pidns1 task1 (core_pattern is set)
\_pidns2 task2 (core_pattern isn't set)
 \_pidns2 ...
  \_pidns2 taskX (crashed and call do_coredump())

find_task_by_vpid(1) is called from do_coredump() and will return task2,
but if I understand you correctly, the base_task has to be task1 in this
case.

> 
> > After reading these patches, I think it may be a good idea to add one
> > more mode to handle core files, when a core file is sent to an
> > existing process instead of executing a usermode helpers. This mode
> > will cover of use-cases where the pipe mode is used now. And it looks
> > more secure, because in this case we control namespaces and credential
> > for the abort daemon (a process which handles core files).u
> > 
> > How it can be done. An abort daemon creates an abstract listening unix
> > socket and writes its name into /proc/sys/kernel/core_pattern. The
> > kernel saves this name and the network namespace. Then when any
> > process is crashed, the kernel creates a new unix socket and connect
> > it to the socket of the abort daemon and streams a core file into this
> > socket.
> > 
> Good idea.
> 
> And do it means if we want to custom core_pattern for each container,
> we can create more than one abort daemon for it?

Yes.

> If the abort daemon is running in the host, the dump file will be write
> into host's env, and if abort daemon is running in container, the dump
> file will be write into container's fs.

Yes. An abort daemon handles core files in its namespaces. The abort
daemon knows a pid of a crashed process, so it has access to namespaces of
the crashed task via /proc/pid/ns/* and can work with them. I want to say
that this daemon can decided in which context to handle a core file.

I don't like the idea to play with namespaces, credentials, an security
context for user helpers.

> 
> > >
> > > Suggested-by: Eric W. Biederman <ebieyderm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> > > Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> > >
> > > Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > > ---
> > >  fs/coredump.c           | 87
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  include/linux/binfmts.h |  1 +
> > >  2 files changed, 87 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > index 281b768..8511267 100644
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info
> > *info, struct cred *new)
> > >  {
> > >         struct file *files[2];
> > >         struct coredump_params *cp = (struct coredump_params
> > *)info->data;
> > > +       struct task_struct *base_task;
> > > +
> > >         int err = create_pipe_files(files, 0);
> > >         if (err)
> > >                 return err;
> > > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info
> > *info, struct cred *new)
> > >
> > >         err = replace_fd(0, files[0], 0);
> > >         fput(files[0]);
> > > +       if (err)
> > > +               return err;
> > > +
> > >         /* and disallow core files too */
> > >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> > >
> > > -       return err;
> > > +       base_task = cp->base_task;
> > > +       if (base_task) {
> > > +               const struct cred *base_cred;
> > > +
> > > +               /* Set fs_root to base_task */
> > > +               spin_lock(&base_task->fs->lock);
> > > +               set_fs_root(current->fs, &base_task->fs->root);
> > > +               spin_unlock(&base_task->fs->lock);
> > > +
> > > +               /* Set namespaces to base_task */
> > > +               get_nsproxy(base_task->nsproxy);
> > > +               switch_task_namespaces(current, base_task->nsproxy);
> > 
> > This task will continue running in the current pid namespace, because
> > switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> > need to make setns+fork to switch a pid namespace.
> > 
> Yes.
> Now I known why I had not find this problem in my test,
> I output pids by a shell command, and the shell command is already got forked.
> 
> Seems need double fork(make code complex...) to do it.
> 
> > > +
> > > +               /* Set cgroup to base_task */
> > > +               current->flags &= ~PF_NO_SETAFFINITY;
> > > +               err = cgroup_attach_task_all(base_task, current);
> > > +               if (err < 0)
> > > +                       return err;
> > > +
> > > +               /* Set cred to base_task */
> > > +               base_cred = get_task_cred(base_task);
> > > +
> > 
> > I think you can use prepare_kernel_cred here
> > 
> prepare_kernel_cred() is called by caller of this function(call_usermodehelper_exec_async()),
> and this function is a hook, which can set additional cred property like:
> call_usermodehelper_exec_async()
> {
>     prepare_kernel_cred()
>     this_function()
>     {
>         custom_cred struct
>     } 
>     commit_creds(new);
> }
> 
> So we can only change cred's container here.
> 
> But seems we need rewrite this code when we move to above "double fork".
> 
> Before rewrite, could you give me some suggestion, if current code
> already can custom to dump information of "container internal" and "parent level container",
> is this meet your request?
> 
> Thanks
> Zhaolei
> 
> > > +               new->uid   = base_cred->uid;
> > > +               new->gid   = base_cred->gid;
> > > +               new->suid  = base_cred->suid;
> > > +               new->sgid  = base_cred->sgid;
> > > +               new->euid  = base_cred->euid;
> > > +               new->egid  = base_cred->egid;
> > > +               new->fsuid = base_cred->fsuid;
> > > +               new->fsgid = base_cred->fsgid;
> > > +
> > > +               new->securebits = base_cred->securebits;
> > > +
> > > +               new->cap_inheritable = base_cred->cap_inheritable;
> > > +               new->cap_permitted   = base_cred->cap_permitted;
> > > +               new->cap_effective   = base_cred->cap_effective;
> > > +               new->cap_bset        = base_cred->cap_bset;
> > > +               new->cap_ambient     = base_cred->cap_ambient;
> > > +
> > > +               security_cred_free(new);
> > > +#ifdef CONFIG_SECURITY
> > > +               new->security = NULL;
> > > +#endif
> > > +               err = security_prepare_creds(new, base_cred,
> > GFP_KERNEL);
> > > +               if (err < 0) {
> > > +                       put_cred(base_cred);
> > > +                       return err;
> > > +               }
> > > +
> > > +               free_uid(new->user);
> > > +               new->user = base_cred->user;
> > > +               get_uid(new->user);
> > > +
> > > +               put_user_ns(new->user_ns);
> > > +               new->user_ns = base_cred->user_ns;
> > > +               get_user_ns(new->user_ns);
> > > +
> > > +               put_group_info(new->group_info);
> > > +               new->group_info = base_cred->group_info;
> > > +               get_group_info(new->group_info);
> > > +
> > > +               put_cred(base_cred);
> > > +
> > > +               validate_creds(new);
> > > +       }
> > > +
> > > +       return 0;
> > >  }
> > >
> > >  void do_coredump(const siginfo_t *siginfo)
> > > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> > >
> > >         if (ispipe) {
> > >                 int dump_count;
> > > +               struct task_struct *vinit_task;
> > >                 char **helper_argv;
> > >                 struct subprocess_info *sub_info;
> > >
> > > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         goto fail_dropcount;
> > >                 }
> > >
> > > +               rcu_read_lock();
> > > +               vinit_task = find_task_by_vpid(1);
> > > +               rcu_read_unlock();
> > > +               if (!vinit_task) {
> > > +                       printk(KERN_WARNING "failed getting init task
> > info, skipping core dump\n");
> > > +                       goto fail_dropcount;
> > > +               }
> > > +
> > >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> > NULL);
> > >                 if (!helper_argv) {
> > >                         printk(KERN_WARNING "%s failed to allocate
> > memory\n",
> > > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         goto fail_dropcount;
> > >                 }
> > >
> > > +               get_task_struct(vinit_task);
> > > +
> > > +               cprm.base_task = vinit_task;
> > > +
> > >                 retval = -ENOMEM;
> > >                 sub_info = call_usermodehelper_setup(helper_argv[0],
> > >                                                 helper_argv,
> > NULL, GFP_KERNEL,
> > > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         retval = call_usermodehelper_exec(sub_info,
> > >
> > UMH_WAIT_EXEC);
> > >
> > > +               put_task_struct(vinit_task);
> > >                 argv_free(helper_argv);
> > >                 if (retval) {
> > >                         printk(KERN_INFO "Core dump to |%s pipe
> > failed\n",
> > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > > index 314b3ca..0c9a72c 100644
> > > --- a/include/linux/binfmts.h
> > > +++ b/include/linux/binfmts.h
> > > @@ -59,6 +59,7 @@ struct linux_binprm {
> > >
> > >  /* Function parameter for binfmt->coredump */
> > >  struct coredump_params {
> > > +       struct task_struct *base_task;
> > >         const siginfo_t *siginfo;
> > >         struct pt_regs *regs;
> > >         struct file *file;
> > > --
> > > 1.8.5.1
> > >
> > >
> > >
> > > _______________________________________________
> > > Containers mailing list
> > > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
@ 2016-08-05 17:13               ` 'Andrei Vagin'
  0 siblings, 0 replies; 14+ messages in thread
From: 'Andrei Vagin' @ 2016-08-05 17:13 UTC (permalink / raw)
  To: Zhao Lei
  Cc: 'LKML', 'Linux Containers', 'Eric W. Biederman'

On Fri, Aug 05, 2016 at 03:52:25PM +0800, Zhao Lei wrote:
> Hi, Andrei Vagin
> 
> Thanks for your detailed review and suggestion.
> 
> > -----Original Message-----
> > From: Andrei Vagin [mailto:avagin@gmail.com]
> > Sent: Friday, August 05, 2016 2:32 PM
> > To: Zhao Lei <zhaolei@cn.fujitsu.com>
> > Cc: LKML <linux-kernel@vger.kernel.org>; Linux Containers
> > <containers@lists.linux-foundation.org>; Eric W. Biederman
> > <ebiederm@xmission.com>
> > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> > container
> > 
> > On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > > Currently when we set core_pattern to a pipe, the pipe program is
> > > forked by kthread running with root's permission, and write dumpfile
> > > into host's filesystem.
> > > Same thing happened for container, the dumper and dumpfile are also
> > > in host(not in container).
> > >
> > > It have following program:
> > > 1: Not consistent with file_type core_pattern
> > >    When we set core_pattern to a file, the container will write dump
> > >    into container's filesystem instead of host.
> > > 2: Not safe for privileged container
> > >    In a privileged container, user can destroy host system by following
> > >    command:
> > >    # # In a container
> > >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> > >    # make_dump
> > >
> > > This patch switch dumper program's environment to init task, so, for
> > > container, dumper program have same environment with init task in
> > > container, which make dumper program put in container's filesystem, and
> > > write coredump into container's filesystem.
> > > The dumper's permission is also limited into subset of container's init
> > > process.
> > 
> > Does it change the current behavior? A pid namespace may be used for
> > sandboxes. For example, chrome uses it. In this case, we probably want
> > to collect core dumps from a root pid namespace.
> > 
> > If we are going to virtualize core_pattern relative to the pid
> > namespace, maybe it's better to make it optional for a pid namespace.
> > I mean it core_pattern is not set for the current pid namespace, we
> > can check a parent pid namespace and so on. A helper will be executed
> > in a pid namespace, where core_pattern is set.
> > 
> Is it means, we don't want to always gather dump from the container ifself,
> wnd we want to custom which level of the container can gather dump
> information?
> 
> If it is exact, current patch have similar function as above.
> 
> If user had not set core_pattern in a namespace, the dump setting in
> the parent namespace will be used.
> For example,
> HOST        change core_pattern
> Container1  no change
> Container2  change core_pattern
> Container3  no change
> Container4  no change
> 
> If code dump happened in container 2, 3 and 4, the dumper's environment
> is same as container2.
> And if dump happened in container 1 and HOST, the dumper's environment
> will be same as HOST.
> 
> So,
> 1: if a container want to make the custom core dump,
>   it can set core_pattern in container, and the gathered information
>   will be limited in this container.
> 2: If the host(or some top level container) want to gather dump information
>   more than a container internal, the host(or some top level container)
>   can set core_pattern, and not change core_pattern in the container,
>   then the pipe program will be run in same environment as the host,
>   and gather information from host's env.

The behavior which you described is the same with what I'm thinking
about.

But I don't understand how you implement it in a code.
I see that you call find_task_by_vpid(1) to get a base task, so the base
that is always the init task of the current pid namespace. I think we
need to get the init task from a pid namespace where core_pattern is
set, don't we?

pidns1 task1 (core_pattern is set)
\_pidns2 task2 (core_pattern isn't set)
 \_pidns2 ...
  \_pidns2 taskX (crashed and call do_coredump())

find_task_by_vpid(1) is called from do_coredump() and will return task2,
but if I understand you correctly, the base_task has to be task1 in this
case.

> 
> > After reading these patches, I think it may be a good idea to add one
> > more mode to handle core files, when a core file is sent to an
> > existing process instead of executing a usermode helpers. This mode
> > will cover of use-cases where the pipe mode is used now. And it looks
> > more secure, because in this case we control namespaces and credential
> > for the abort daemon (a process which handles core files).u
> > 
> > How it can be done. An abort daemon creates an abstract listening unix
> > socket and writes its name into /proc/sys/kernel/core_pattern. The
> > kernel saves this name and the network namespace. Then when any
> > process is crashed, the kernel creates a new unix socket and connect
> > it to the socket of the abort daemon and streams a core file into this
> > socket.
> > 
> Good idea.
> 
> And do it means if we want to custom core_pattern for each container,
> we can create more than one abort daemon for it?

Yes.

> If the abort daemon is running in the host, the dump file will be write
> into host's env, and if abort daemon is running in container, the dump
> file will be write into container's fs.

Yes. An abort daemon handles core files in its namespaces. The abort
daemon knows a pid of a crashed process, so it has access to namespaces of
the crashed task via /proc/pid/ns/* and can work with them. I want to say
that this daemon can decided in which context to handle a core file.

I don't like the idea to play with namespaces, credentials, an security
context for user helpers.

> 
> > >
> > > Suggested-by: Eric W. Biederman <ebieyderm@xmission.com>
> > > Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > ---
> > >  fs/coredump.c           | 87
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  include/linux/binfmts.h |  1 +
> > >  2 files changed, 87 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > index 281b768..8511267 100644
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info
> > *info, struct cred *new)
> > >  {
> > >         struct file *files[2];
> > >         struct coredump_params *cp = (struct coredump_params
> > *)info->data;
> > > +       struct task_struct *base_task;
> > > +
> > >         int err = create_pipe_files(files, 0);
> > >         if (err)
> > >                 return err;
> > > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info
> > *info, struct cred *new)
> > >
> > >         err = replace_fd(0, files[0], 0);
> > >         fput(files[0]);
> > > +       if (err)
> > > +               return err;
> > > +
> > >         /* and disallow core files too */
> > >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> > >
> > > -       return err;
> > > +       base_task = cp->base_task;
> > > +       if (base_task) {
> > > +               const struct cred *base_cred;
> > > +
> > > +               /* Set fs_root to base_task */
> > > +               spin_lock(&base_task->fs->lock);
> > > +               set_fs_root(current->fs, &base_task->fs->root);
> > > +               spin_unlock(&base_task->fs->lock);
> > > +
> > > +               /* Set namespaces to base_task */
> > > +               get_nsproxy(base_task->nsproxy);
> > > +               switch_task_namespaces(current, base_task->nsproxy);
> > 
> > This task will continue running in the current pid namespace, because
> > switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> > need to make setns+fork to switch a pid namespace.
> > 
> Yes.
> Now I known why I had not find this problem in my test,
> I output pids by a shell command, and the shell command is already got forked.
> 
> Seems need double fork(make code complex...) to do it.
> 
> > > +
> > > +               /* Set cgroup to base_task */
> > > +               current->flags &= ~PF_NO_SETAFFINITY;
> > > +               err = cgroup_attach_task_all(base_task, current);
> > > +               if (err < 0)
> > > +                       return err;
> > > +
> > > +               /* Set cred to base_task */
> > > +               base_cred = get_task_cred(base_task);
> > > +
> > 
> > I think you can use prepare_kernel_cred here
> > 
> prepare_kernel_cred() is called by caller of this function(call_usermodehelper_exec_async()),
> and this function is a hook, which can set additional cred property like:
> call_usermodehelper_exec_async()
> {
>     prepare_kernel_cred()
>     this_function()
>     {
>         custom_cred struct
>     } 
>     commit_creds(new);
> }
> 
> So we can only change cred's container here.
> 
> But seems we need rewrite this code when we move to above "double fork".
> 
> Before rewrite, could you give me some suggestion, if current code
> already can custom to dump information of "container internal" and "parent level container",
> is this meet your request?
> 
> Thanks
> Zhaolei
> 
> > > +               new->uid   = base_cred->uid;
> > > +               new->gid   = base_cred->gid;
> > > +               new->suid  = base_cred->suid;
> > > +               new->sgid  = base_cred->sgid;
> > > +               new->euid  = base_cred->euid;
> > > +               new->egid  = base_cred->egid;
> > > +               new->fsuid = base_cred->fsuid;
> > > +               new->fsgid = base_cred->fsgid;
> > > +
> > > +               new->securebits = base_cred->securebits;
> > > +
> > > +               new->cap_inheritable = base_cred->cap_inheritable;
> > > +               new->cap_permitted   = base_cred->cap_permitted;
> > > +               new->cap_effective   = base_cred->cap_effective;
> > > +               new->cap_bset        = base_cred->cap_bset;
> > > +               new->cap_ambient     = base_cred->cap_ambient;
> > > +
> > > +               security_cred_free(new);
> > > +#ifdef CONFIG_SECURITY
> > > +               new->security = NULL;
> > > +#endif
> > > +               err = security_prepare_creds(new, base_cred,
> > GFP_KERNEL);
> > > +               if (err < 0) {
> > > +                       put_cred(base_cred);
> > > +                       return err;
> > > +               }
> > > +
> > > +               free_uid(new->user);
> > > +               new->user = base_cred->user;
> > > +               get_uid(new->user);
> > > +
> > > +               put_user_ns(new->user_ns);
> > > +               new->user_ns = base_cred->user_ns;
> > > +               get_user_ns(new->user_ns);
> > > +
> > > +               put_group_info(new->group_info);
> > > +               new->group_info = base_cred->group_info;
> > > +               get_group_info(new->group_info);
> > > +
> > > +               put_cred(base_cred);
> > > +
> > > +               validate_creds(new);
> > > +       }
> > > +
> > > +       return 0;
> > >  }
> > >
> > >  void do_coredump(const siginfo_t *siginfo)
> > > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> > >
> > >         if (ispipe) {
> > >                 int dump_count;
> > > +               struct task_struct *vinit_task;
> > >                 char **helper_argv;
> > >                 struct subprocess_info *sub_info;
> > >
> > > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         goto fail_dropcount;
> > >                 }
> > >
> > > +               rcu_read_lock();
> > > +               vinit_task = find_task_by_vpid(1);
> > > +               rcu_read_unlock();
> > > +               if (!vinit_task) {
> > > +                       printk(KERN_WARNING "failed getting init task
> > info, skipping core dump\n");
> > > +                       goto fail_dropcount;
> > > +               }
> > > +
> > >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> > NULL);
> > >                 if (!helper_argv) {
> > >                         printk(KERN_WARNING "%s failed to allocate
> > memory\n",
> > > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         goto fail_dropcount;
> > >                 }
> > >
> > > +               get_task_struct(vinit_task);
> > > +
> > > +               cprm.base_task = vinit_task;
> > > +
> > >                 retval = -ENOMEM;
> > >                 sub_info = call_usermodehelper_setup(helper_argv[0],
> > >                                                 helper_argv,
> > NULL, GFP_KERNEL,
> > > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> > >                         retval = call_usermodehelper_exec(sub_info,
> > >
> > UMH_WAIT_EXEC);
> > >
> > > +               put_task_struct(vinit_task);
> > >                 argv_free(helper_argv);
> > >                 if (retval) {
> > >                         printk(KERN_INFO "Core dump to |%s pipe
> > failed\n",
> > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > > index 314b3ca..0c9a72c 100644
> > > --- a/include/linux/binfmts.h
> > > +++ b/include/linux/binfmts.h
> > > @@ -59,6 +59,7 @@ struct linux_binprm {
> > >
> > >  /* Function parameter for binfmt->coredump */
> > >  struct coredump_params {
> > > +       struct task_struct *base_task;
> > >         const siginfo_t *siginfo;
> > >         struct pt_regs *regs;
> > >         struct file *file;
> > > --
> > > 1.8.5.1
> > >
> > >
> > >
> > > _______________________________________________
> > > Containers mailing list
> > > Containers@lists.linux-foundation.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
  2016-08-05 17:13               ` 'Andrei Vagin'
@ 2016-08-12  2:11                   ` Zhao Lei
  -1 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-12  2:11 UTC (permalink / raw)
  To: 'Andrei Vagin'
  Cc: 'Linux Containers', 'LKML', 'Eric W. Biederman'

Hi, Andrei Vagin

> -----Original Message-----
> From: 'Andrei Vagin' [mailto:avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> Sent: Saturday, August 06, 2016 1:14 AM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: 'LKML' <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; 'Linux Containers'
> <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>; 'Eric W. Biederman'
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> container
> 
> On Fri, Aug 05, 2016 at 03:52:25PM +0800, Zhao Lei wrote:
> > Hi, Andrei Vagin
> >
> > Thanks for your detailed review and suggestion.
> >
> > > -----Original Message-----
> > > From: Andrei Vagin [mailto:avagin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> > > Sent: Friday, August 05, 2016 2:32 PM
> > > To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > > Cc: LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; Linux Containers
> > > <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>; Eric W. Biederman
> > > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> > > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init
> for
> > > container
> > >
> > > On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> > > > Currently when we set core_pattern to a pipe, the pipe program is
> > > > forked by kthread running with root's permission, and write dumpfile
> > > > into host's filesystem.
> > > > Same thing happened for container, the dumper and dumpfile are also
> > > > in host(not in container).
> > > >
> > > > It have following program:
> > > > 1: Not consistent with file_type core_pattern
> > > >    When we set core_pattern to a file, the container will write dump
> > > >    into container's filesystem instead of host.
> > > > 2: Not safe for privileged container
> > > >    In a privileged container, user can destroy host system by following
> > > >    command:
> > > >    # # In a container
> > > >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> > > >    # make_dump
> > > >
> > > > This patch switch dumper program's environment to init task, so, for
> > > > container, dumper program have same environment with init task in
> > > > container, which make dumper program put in container's filesystem, and
> > > > write coredump into container's filesystem.
> > > > The dumper's permission is also limited into subset of container's init
> > > > process.
> > >
> > > Does it change the current behavior? A pid namespace may be used for
> > > sandboxes. For example, chrome uses it. In this case, we probably want
> > > to collect core dumps from a root pid namespace.
> > >
> > > If we are going to virtualize core_pattern relative to the pid
> > > namespace, maybe it's better to make it optional for a pid namespace.
> > > I mean it core_pattern is not set for the current pid namespace, we
> > > can check a parent pid namespace and so on. A helper will be executed
> > > in a pid namespace, where core_pattern is set.
> > >
> > Is it means, we don't want to always gather dump from the container ifself,
> > wnd we want to custom which level of the container can gather dump
> > information?
> >
> > If it is exact, current patch have similar function as above.
> >
> > If user had not set core_pattern in a namespace, the dump setting in
> > the parent namespace will be used.
> > For example,
> > HOST        change core_pattern
> > Container1  no change
> > Container2  change core_pattern
> > Container3  no change
> > Container4  no change
> >
> > If code dump happened in container 2, 3 and 4, the dumper's environment
> > is same as container2.
> > And if dump happened in container 1 and HOST, the dumper's environment
> > will be same as HOST.
> >
> > So,
> > 1: if a container want to make the custom core dump,
> >   it can set core_pattern in container, and the gathered information
> >   will be limited in this container.
> > 2: If the host(or some top level container) want to gather dump information
> >   more than a container internal, the host(or some top level container)
> >   can set core_pattern, and not change core_pattern in the container,
> >   then the pipe program will be run in same environment as the host,
> >   and gather information from host's env.
> 
> The behavior which you described is the same with what I'm thinking
> about.
> 
> But I don't understand how you implement it in a code.
> I see that you call find_task_by_vpid(1) to get a base task, so the base
> that is always the init task of the current pid namespace. I think we
> need to get the init task from a pid namespace where core_pattern is
> set, don't we?
> 
> pidns1 task1 (core_pattern is set)
> \_pidns2 task2 (core_pattern isn't set)
>  \_pidns2 ...
>   \_pidns2 taskX (crashed and call do_coredump())
> 
> find_task_by_vpid(1) is called from do_coredump() and will return task2,
> but if I understand you correctly, the base_task has to be task1 in this
> case.
> 
find_task_by_vpid(1) is changed in [PATCH 2/2].

It changed core_pattern to per-ns instead of global one, and save pid_ns of
the task who changed core_pattern, and use above pid_ns for find_task_by_vpid().

In above pid tree, the pidns2 taskX crashed, then get core_pattern and pid_ns from
pidns1's struct, so we can run pipe_program in pidns1.

> >
> > > After reading these patches, I think it may be a good idea to add one
> > > more mode to handle core files, when a core file is sent to an
> > > existing process instead of executing a usermode helpers. This mode
> > > will cover of use-cases where the pipe mode is used now. And it looks
> > > more secure, because in this case we control namespaces and credential
> > > for the abort daemon (a process which handles core files).u
> > >
> > > How it can be done. An abort daemon creates an abstract listening unix
> > > socket and writes its name into /proc/sys/kernel/core_pattern. The
> > > kernel saves this name and the network namespace. Then when any
> > > process is crashed, the kernel creates a new unix socket and connect
> > > it to the socket of the abort daemon and streams a core file into this
> > > socket.
> > >
> > Good idea.
> >
> > And do it means if we want to custom core_pattern for each container,
> > we can create more than one abort daemon for it?
> 
> Yes.
> 
> > If the abort daemon is running in the host, the dump file will be write
> > into host's env, and if abort daemon is running in container, the dump
> > file will be write into container's fs.
> 
> Yes. An abort daemon handles core files in its namespaces. The abort
> daemon knows a pid of a crashed process, so it has access to namespaces of
> the crashed task via /proc/pid/ns/* and can work with them. I want to say
> that this daemon can decided in which context to handle a core file.
> 
> I don't like the idea to play with namespaces, credentials, an security
> context for user helpers.
> 
Thanks for your detained explanation, I think it should be a useful function
for core dump.

Thanks
Zhaolei

> >
> > > >
> > > > Suggested-by: Eric W. Biederman <ebieyderm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> > > > Suggested-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> > > >
> > > > Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> > > > ---
> > > >  fs/coredump.c           | 87
> > > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  include/linux/binfmts.h |  1 +
> > > >  2 files changed, 87 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > > index 281b768..8511267 100644
> > > > --- a/fs/coredump.c
> > > > +++ b/fs/coredump.c
> > > > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct
> subprocess_info
> > > *info, struct cred *new)
> > > >  {
> > > >         struct file *files[2];
> > > >         struct coredump_params *cp = (struct coredump_params
> > > *)info->data;
> > > > +       struct task_struct *base_task;
> > > > +
> > > >         int err = create_pipe_files(files, 0);
> > > >         if (err)
> > > >                 return err;
> > > > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct
> subprocess_info
> > > *info, struct cred *new)
> > > >
> > > >         err = replace_fd(0, files[0], 0);
> > > >         fput(files[0]);
> > > > +       if (err)
> > > > +               return err;
> > > > +
> > > >         /* and disallow core files too */
> > > >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> > > >
> > > > -       return err;
> > > > +       base_task = cp->base_task;
> > > > +       if (base_task) {
> > > > +               const struct cred *base_cred;
> > > > +
> > > > +               /* Set fs_root to base_task */
> > > > +               spin_lock(&base_task->fs->lock);
> > > > +               set_fs_root(current->fs, &base_task->fs->root);
> > > > +               spin_unlock(&base_task->fs->lock);
> > > > +
> > > > +               /* Set namespaces to base_task */
> > > > +               get_nsproxy(base_task->nsproxy);
> > > > +               switch_task_namespaces(current,
> base_task->nsproxy);
> > >
> > > This task will continue running in the current pid namespace, because
> > > switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> > > need to make setns+fork to switch a pid namespace.
> > >
> > Yes.
> > Now I known why I had not find this problem in my test,
> > I output pids by a shell command, and the shell command is already got
> forked.
> >
> > Seems need double fork(make code complex...) to do it.
> >
> > > > +
> > > > +               /* Set cgroup to base_task */
> > > > +               current->flags &= ~PF_NO_SETAFFINITY;
> > > > +               err = cgroup_attach_task_all(base_task, current);
> > > > +               if (err < 0)
> > > > +                       return err;
> > > > +
> > > > +               /* Set cred to base_task */
> > > > +               base_cred = get_task_cred(base_task);
> > > > +
> > >
> > > I think you can use prepare_kernel_cred here
> > >
> > prepare_kernel_cred() is called by caller of this
> function(call_usermodehelper_exec_async()),
> > and this function is a hook, which can set additional cred property like:
> > call_usermodehelper_exec_async()
> > {
> >     prepare_kernel_cred()
> >     this_function()
> >     {
> >         custom_cred struct
> >     }
> >     commit_creds(new);
> > }
> >
> > So we can only change cred's container here.
> >
> > But seems we need rewrite this code when we move to above "double fork".
> >
> > Before rewrite, could you give me some suggestion, if current code
> > already can custom to dump information of "container internal" and "parent
> level container",
> > is this meet your request?
> >
> > Thanks
> > Zhaolei
> >
> > > > +               new->uid   = base_cred->uid;
> > > > +               new->gid   = base_cred->gid;
> > > > +               new->suid  = base_cred->suid;
> > > > +               new->sgid  = base_cred->sgid;
> > > > +               new->euid  = base_cred->euid;
> > > > +               new->egid  = base_cred->egid;
> > > > +               new->fsuid = base_cred->fsuid;
> > > > +               new->fsgid = base_cred->fsgid;
> > > > +
> > > > +               new->securebits = base_cred->securebits;
> > > > +
> > > > +               new->cap_inheritable = base_cred->cap_inheritable;
> > > > +               new->cap_permitted   =
> base_cred->cap_permitted;
> > > > +               new->cap_effective   = base_cred->cap_effective;
> > > > +               new->cap_bset        = base_cred->cap_bset;
> > > > +               new->cap_ambient     = base_cred->cap_ambient;
> > > > +
> > > > +               security_cred_free(new);
> > > > +#ifdef CONFIG_SECURITY
> > > > +               new->security = NULL;
> > > > +#endif
> > > > +               err = security_prepare_creds(new, base_cred,
> > > GFP_KERNEL);
> > > > +               if (err < 0) {
> > > > +                       put_cred(base_cred);
> > > > +                       return err;
> > > > +               }
> > > > +
> > > > +               free_uid(new->user);
> > > > +               new->user = base_cred->user;
> > > > +               get_uid(new->user);
> > > > +
> > > > +               put_user_ns(new->user_ns);
> > > > +               new->user_ns = base_cred->user_ns;
> > > > +               get_user_ns(new->user_ns);
> > > > +
> > > > +               put_group_info(new->group_info);
> > > > +               new->group_info = base_cred->group_info;
> > > > +               get_group_info(new->group_info);
> > > > +
> > > > +               put_cred(base_cred);
> > > > +
> > > > +               validate_creds(new);
> > > > +       }
> > > > +
> > > > +       return 0;
> > > >  }
> > > >
> > > >  void do_coredump(const siginfo_t *siginfo)
> > > > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> > > >
> > > >         if (ispipe) {
> > > >                 int dump_count;
> > > > +               struct task_struct *vinit_task;
> > > >                 char **helper_argv;
> > > >                 struct subprocess_info *sub_info;
> > > >
> > > > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         goto fail_dropcount;
> > > >                 }
> > > >
> > > > +               rcu_read_lock();
> > > > +               vinit_task = find_task_by_vpid(1);
> > > > +               rcu_read_unlock();
> > > > +               if (!vinit_task) {
> > > > +                       printk(KERN_WARNING "failed getting init
> task
> > > info, skipping core dump\n");
> > > > +                       goto fail_dropcount;
> > > > +               }
> > > > +
> > > >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> > > NULL);
> > > >                 if (!helper_argv) {
> > > >                         printk(KERN_WARNING "%s failed to
> allocate
> > > memory\n",
> > > > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         goto fail_dropcount;
> > > >                 }
> > > >
> > > > +               get_task_struct(vinit_task);
> > > > +
> > > > +               cprm.base_task = vinit_task;
> > > > +
> > > >                 retval = -ENOMEM;
> > > >                 sub_info =
> call_usermodehelper_setup(helper_argv[0],
> > > >                                                 helper_argv,
> > > NULL, GFP_KERNEL,
> > > > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         retval =
> call_usermodehelper_exec(sub_info,
> > > >
> > > UMH_WAIT_EXEC);
> > > >
> > > > +               put_task_struct(vinit_task);
> > > >                 argv_free(helper_argv);
> > > >                 if (retval) {
> > > >                         printk(KERN_INFO "Core dump to |%s pipe
> > > failed\n",
> > > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > > > index 314b3ca..0c9a72c 100644
> > > > --- a/include/linux/binfmts.h
> > > > +++ b/include/linux/binfmts.h
> > > > @@ -59,6 +59,7 @@ struct linux_binprm {
> > > >
> > > >  /* Function parameter for binfmt->coredump */
> > > >  struct coredump_params {
> > > > +       struct task_struct *base_task;
> > > >         const siginfo_t *siginfo;
> > > >         struct pt_regs *regs;
> > > >         struct file *file;
> > > > --
> > > > 1.8.5.1
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Containers mailing list
> > > > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > > > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container
@ 2016-08-12  2:11                   ` Zhao Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Zhao Lei @ 2016-08-12  2:11 UTC (permalink / raw)
  To: 'Andrei Vagin'
  Cc: 'LKML', 'Linux Containers', 'Eric W. Biederman'

Hi, Andrei Vagin

> -----Original Message-----
> From: 'Andrei Vagin' [mailto:avagin@gmail.com]
> Sent: Saturday, August 06, 2016 1:14 AM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: 'LKML' <linux-kernel@vger.kernel.org>; 'Linux Containers'
> <containers@lists.linux-foundation.org>; 'Eric W. Biederman'
> <ebiederm@xmission.com>
> Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for
> container
> 
> On Fri, Aug 05, 2016 at 03:52:25PM +0800, Zhao Lei wrote:
> > Hi, Andrei Vagin
> >
> > Thanks for your detailed review and suggestion.
> >
> > > -----Original Message-----
> > > From: Andrei Vagin [mailto:avagin@gmail.com]
> > > Sent: Friday, August 05, 2016 2:32 PM
> > > To: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > Cc: LKML <linux-kernel@vger.kernel.org>; Linux Containers
> > > <containers@lists.linux-foundation.org>; Eric W. Biederman
> > > <ebiederm@xmission.com>
> > > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init
> for
> > > container
> > >
> > > On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > > > Currently when we set core_pattern to a pipe, the pipe program is
> > > > forked by kthread running with root's permission, and write dumpfile
> > > > into host's filesystem.
> > > > Same thing happened for container, the dumper and dumpfile are also
> > > > in host(not in container).
> > > >
> > > > It have following program:
> > > > 1: Not consistent with file_type core_pattern
> > > >    When we set core_pattern to a file, the container will write dump
> > > >    into container's filesystem instead of host.
> > > > 2: Not safe for privileged container
> > > >    In a privileged container, user can destroy host system by following
> > > >    command:
> > > >    # # In a container
> > > >    # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> > > >    # make_dump
> > > >
> > > > This patch switch dumper program's environment to init task, so, for
> > > > container, dumper program have same environment with init task in
> > > > container, which make dumper program put in container's filesystem, and
> > > > write coredump into container's filesystem.
> > > > The dumper's permission is also limited into subset of container's init
> > > > process.
> > >
> > > Does it change the current behavior? A pid namespace may be used for
> > > sandboxes. For example, chrome uses it. In this case, we probably want
> > > to collect core dumps from a root pid namespace.
> > >
> > > If we are going to virtualize core_pattern relative to the pid
> > > namespace, maybe it's better to make it optional for a pid namespace.
> > > I mean it core_pattern is not set for the current pid namespace, we
> > > can check a parent pid namespace and so on. A helper will be executed
> > > in a pid namespace, where core_pattern is set.
> > >
> > Is it means, we don't want to always gather dump from the container ifself,
> > wnd we want to custom which level of the container can gather dump
> > information?
> >
> > If it is exact, current patch have similar function as above.
> >
> > If user had not set core_pattern in a namespace, the dump setting in
> > the parent namespace will be used.
> > For example,
> > HOST        change core_pattern
> > Container1  no change
> > Container2  change core_pattern
> > Container3  no change
> > Container4  no change
> >
> > If code dump happened in container 2, 3 and 4, the dumper's environment
> > is same as container2.
> > And if dump happened in container 1 and HOST, the dumper's environment
> > will be same as HOST.
> >
> > So,
> > 1: if a container want to make the custom core dump,
> >   it can set core_pattern in container, and the gathered information
> >   will be limited in this container.
> > 2: If the host(or some top level container) want to gather dump information
> >   more than a container internal, the host(or some top level container)
> >   can set core_pattern, and not change core_pattern in the container,
> >   then the pipe program will be run in same environment as the host,
> >   and gather information from host's env.
> 
> The behavior which you described is the same with what I'm thinking
> about.
> 
> But I don't understand how you implement it in a code.
> I see that you call find_task_by_vpid(1) to get a base task, so the base
> that is always the init task of the current pid namespace. I think we
> need to get the init task from a pid namespace where core_pattern is
> set, don't we?
> 
> pidns1 task1 (core_pattern is set)
> \_pidns2 task2 (core_pattern isn't set)
>  \_pidns2 ...
>   \_pidns2 taskX (crashed and call do_coredump())
> 
> find_task_by_vpid(1) is called from do_coredump() and will return task2,
> but if I understand you correctly, the base_task has to be task1 in this
> case.
> 
find_task_by_vpid(1) is changed in [PATCH 2/2].

It changed core_pattern to per-ns instead of global one, and save pid_ns of
the task who changed core_pattern, and use above pid_ns for find_task_by_vpid().

In above pid tree, the pidns2 taskX crashed, then get core_pattern and pid_ns from
pidns1's struct, so we can run pipe_program in pidns1.

> >
> > > After reading these patches, I think it may be a good idea to add one
> > > more mode to handle core files, when a core file is sent to an
> > > existing process instead of executing a usermode helpers. This mode
> > > will cover of use-cases where the pipe mode is used now. And it looks
> > > more secure, because in this case we control namespaces and credential
> > > for the abort daemon (a process which handles core files).u
> > >
> > > How it can be done. An abort daemon creates an abstract listening unix
> > > socket and writes its name into /proc/sys/kernel/core_pattern. The
> > > kernel saves this name and the network namespace. Then when any
> > > process is crashed, the kernel creates a new unix socket and connect
> > > it to the socket of the abort daemon and streams a core file into this
> > > socket.
> > >
> > Good idea.
> >
> > And do it means if we want to custom core_pattern for each container,
> > we can create more than one abort daemon for it?
> 
> Yes.
> 
> > If the abort daemon is running in the host, the dump file will be write
> > into host's env, and if abort daemon is running in container, the dump
> > file will be write into container's fs.
> 
> Yes. An abort daemon handles core files in its namespaces. The abort
> daemon knows a pid of a crashed process, so it has access to namespaces of
> the crashed task via /proc/pid/ns/* and can work with them. I want to say
> that this daemon can decided in which context to handle a core file.
> 
> I don't like the idea to play with namespaces, credentials, an security
> context for user helpers.
> 
Thanks for your detained explanation, I think it should be a useful function
for core dump.

Thanks
Zhaolei

> >
> > > >
> > > > Suggested-by: Eric W. Biederman <ebieyderm@xmission.com>
> > > > Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > >
> > > > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> > > > ---
> > > >  fs/coredump.c           | 87
> > > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  include/linux/binfmts.h |  1 +
> > > >  2 files changed, 87 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/coredump.c b/fs/coredump.c
> > > > index 281b768..8511267 100644
> > > > --- a/fs/coredump.c
> > > > +++ b/fs/coredump.c
> > > > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct
> subprocess_info
> > > *info, struct cred *new)
> > > >  {
> > > >         struct file *files[2];
> > > >         struct coredump_params *cp = (struct coredump_params
> > > *)info->data;
> > > > +       struct task_struct *base_task;
> > > > +
> > > >         int err = create_pipe_files(files, 0);
> > > >         if (err)
> > > >                 return err;
> > > > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct
> subprocess_info
> > > *info, struct cred *new)
> > > >
> > > >         err = replace_fd(0, files[0], 0);
> > > >         fput(files[0]);
> > > > +       if (err)
> > > > +               return err;
> > > > +
> > > >         /* and disallow core files too */
> > > >         current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
> > > >
> > > > -       return err;
> > > > +       base_task = cp->base_task;
> > > > +       if (base_task) {
> > > > +               const struct cred *base_cred;
> > > > +
> > > > +               /* Set fs_root to base_task */
> > > > +               spin_lock(&base_task->fs->lock);
> > > > +               set_fs_root(current->fs, &base_task->fs->root);
> > > > +               spin_unlock(&base_task->fs->lock);
> > > > +
> > > > +               /* Set namespaces to base_task */
> > > > +               get_nsproxy(base_task->nsproxy);
> > > > +               switch_task_namespaces(current,
> base_task->nsproxy);
> > >
> > > This task will continue running in the current pid namespace, because
> > > switch_task_namespaces doesn't change a pid for the task. Ussualy, we
> > > need to make setns+fork to switch a pid namespace.
> > >
> > Yes.
> > Now I known why I had not find this problem in my test,
> > I output pids by a shell command, and the shell command is already got
> forked.
> >
> > Seems need double fork(make code complex...) to do it.
> >
> > > > +
> > > > +               /* Set cgroup to base_task */
> > > > +               current->flags &= ~PF_NO_SETAFFINITY;
> > > > +               err = cgroup_attach_task_all(base_task, current);
> > > > +               if (err < 0)
> > > > +                       return err;
> > > > +
> > > > +               /* Set cred to base_task */
> > > > +               base_cred = get_task_cred(base_task);
> > > > +
> > >
> > > I think you can use prepare_kernel_cred here
> > >
> > prepare_kernel_cred() is called by caller of this
> function(call_usermodehelper_exec_async()),
> > and this function is a hook, which can set additional cred property like:
> > call_usermodehelper_exec_async()
> > {
> >     prepare_kernel_cred()
> >     this_function()
> >     {
> >         custom_cred struct
> >     }
> >     commit_creds(new);
> > }
> >
> > So we can only change cred's container here.
> >
> > But seems we need rewrite this code when we move to above "double fork".
> >
> > Before rewrite, could you give me some suggestion, if current code
> > already can custom to dump information of "container internal" and "parent
> level container",
> > is this meet your request?
> >
> > Thanks
> > Zhaolei
> >
> > > > +               new->uid   = base_cred->uid;
> > > > +               new->gid   = base_cred->gid;
> > > > +               new->suid  = base_cred->suid;
> > > > +               new->sgid  = base_cred->sgid;
> > > > +               new->euid  = base_cred->euid;
> > > > +               new->egid  = base_cred->egid;
> > > > +               new->fsuid = base_cred->fsuid;
> > > > +               new->fsgid = base_cred->fsgid;
> > > > +
> > > > +               new->securebits = base_cred->securebits;
> > > > +
> > > > +               new->cap_inheritable = base_cred->cap_inheritable;
> > > > +               new->cap_permitted   =
> base_cred->cap_permitted;
> > > > +               new->cap_effective   = base_cred->cap_effective;
> > > > +               new->cap_bset        = base_cred->cap_bset;
> > > > +               new->cap_ambient     = base_cred->cap_ambient;
> > > > +
> > > > +               security_cred_free(new);
> > > > +#ifdef CONFIG_SECURITY
> > > > +               new->security = NULL;
> > > > +#endif
> > > > +               err = security_prepare_creds(new, base_cred,
> > > GFP_KERNEL);
> > > > +               if (err < 0) {
> > > > +                       put_cred(base_cred);
> > > > +                       return err;
> > > > +               }
> > > > +
> > > > +               free_uid(new->user);
> > > > +               new->user = base_cred->user;
> > > > +               get_uid(new->user);
> > > > +
> > > > +               put_user_ns(new->user_ns);
> > > > +               new->user_ns = base_cred->user_ns;
> > > > +               get_user_ns(new->user_ns);
> > > > +
> > > > +               put_group_info(new->group_info);
> > > > +               new->group_info = base_cred->group_info;
> > > > +               get_group_info(new->group_info);
> > > > +
> > > > +               put_cred(base_cred);
> > > > +
> > > > +               validate_creds(new);
> > > > +       }
> > > > +
> > > > +       return 0;
> > > >  }
> > > >
> > > >  void do_coredump(const siginfo_t *siginfo)
> > > > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
> > > >
> > > >         if (ispipe) {
> > > >                 int dump_count;
> > > > +               struct task_struct *vinit_task;
> > > >                 char **helper_argv;
> > > >                 struct subprocess_info *sub_info;
> > > >
> > > > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         goto fail_dropcount;
> > > >                 }
> > > >
> > > > +               rcu_read_lock();
> > > > +               vinit_task = find_task_by_vpid(1);
> > > > +               rcu_read_unlock();
> > > > +               if (!vinit_task) {
> > > > +                       printk(KERN_WARNING "failed getting init
> task
> > > info, skipping core dump\n");
> > > > +                       goto fail_dropcount;
> > > > +               }
> > > > +
> > > >                 helper_argv = argv_split(GFP_KERNEL, cn.corename,
> > > NULL);
> > > >                 if (!helper_argv) {
> > > >                         printk(KERN_WARNING "%s failed to
> allocate
> > > memory\n",
> > > > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         goto fail_dropcount;
> > > >                 }
> > > >
> > > > +               get_task_struct(vinit_task);
> > > > +
> > > > +               cprm.base_task = vinit_task;
> > > > +
> > > >                 retval = -ENOMEM;
> > > >                 sub_info =
> call_usermodehelper_setup(helper_argv[0],
> > > >                                                 helper_argv,
> > > NULL, GFP_KERNEL,
> > > > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> > > >                         retval =
> call_usermodehelper_exec(sub_info,
> > > >
> > > UMH_WAIT_EXEC);
> > > >
> > > > +               put_task_struct(vinit_task);
> > > >                 argv_free(helper_argv);
> > > >                 if (retval) {
> > > >                         printk(KERN_INFO "Core dump to |%s pipe
> > > failed\n",
> > > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > > > index 314b3ca..0c9a72c 100644
> > > > --- a/include/linux/binfmts.h
> > > > +++ b/include/linux/binfmts.h
> > > > @@ -59,6 +59,7 @@ struct linux_binprm {
> > > >
> > > >  /* Function parameter for binfmt->coredump */
> > > >  struct coredump_params {
> > > > +       struct task_struct *base_task;
> > > >         const siginfo_t *siginfo;
> > > >         struct pt_regs *regs;
> > > >         struct file *file;
> > > > --
> > > > 1.8.5.1
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Containers mailing list
> > > > Containers@lists.linux-foundation.org
> > > > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-12  2:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-02  9:08 [PATCH v2 0/2] Make core_pattern support namespace Zhao Lei
2016-08-02  9:08 ` Zhao Lei
     [not found] ` <cover.1470128571.git.zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2016-08-02  9:08   ` [PATCH v2 1/2] Limit dump_pipe program's permission to init for container Zhao Lei
2016-08-02  9:08     ` Zhao Lei
     [not found]     ` <c7d8e45aa76a414c73082d22d325cd5a13979773.1470128572.git.zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2016-08-05  6:32       ` Andrei Vagin
2016-08-05  6:32         ` Andrei Vagin
     [not found]         ` <CANaxB-yz2AbjQYLLfXF03A-H=b5FV6+Dc8egorFFydbtZO-9Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-05  7:52           ` Zhao Lei
2016-08-05  7:52             ` Zhao Lei
2016-08-05 17:13             ` 'Andrei Vagin'
2016-08-05 17:13               ` 'Andrei Vagin'
     [not found]               ` <20160805171339.GA19853-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-08-12  2:11                 ` Zhao Lei
2016-08-12  2:11                   ` Zhao Lei
2016-08-02  9:08   ` [PATCH v2 2/2] Make core_pattern support namespace Zhao Lei
2016-08-02  9:08     ` Zhao Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.