All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Make core_pattern support namespace
@ 2016-03-18 12:48 ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

We discussed patch titled:
 [PATCH] Make core_pattern support namespace
before.

Above patch can solve half problem of custom core_dump pattern
in container, but there are also another problem that limit
custom core_pattern in container, it is the pipe-type core_pattern
will write core dump into host's filesystem.
(See discussion of that patch for detail)

Now we can solve the second problem by [PATCH 2/3], I send
the origional patch with it.

Changelog v1->v2:
1: Fix compile fail in xtensa which is reported by
   kbuild test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2: Use a wapper to avoid calling _do_fork() outside kernel/fork.c,
   suggested by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Zhao Lei (3):
  Make _do_fork support return to caller's code
  Run dump pipe in container's namespace
  Make core_pattern support namespace

 arch/alpha/kernel/process.c         |  4 +-
 arch/arc/kernel/process.c           |  4 +-
 arch/arm/kernel/process.c           |  4 +-
 arch/arm64/kernel/process.c         |  5 ++-
 arch/avr32/kernel/process.c         |  5 ++-
 arch/blackfin/kernel/process.c      |  5 ++-
 arch/c6x/kernel/process.c           |  5 ++-
 arch/cris/arch-v10/kernel/process.c |  4 +-
 arch/cris/arch-v32/kernel/process.c |  4 +-
 arch/frv/kernel/process.c           |  4 +-
 arch/h8300/kernel/process.c         |  4 +-
 arch/hexagon/kernel/process.c       |  4 +-
 arch/ia64/kernel/process.c          |  4 +-
 arch/m32r/kernel/process.c          |  4 +-
 arch/m68k/kernel/process.c          |  4 +-
 arch/metag/kernel/process.c         |  5 ++-
 arch/microblaze/kernel/process.c    |  4 +-
 arch/mips/kernel/process.c          |  4 +-
 arch/mn10300/kernel/process.c       |  4 +-
 arch/nios2/kernel/process.c         |  5 ++-
 arch/openrisc/kernel/process.c      |  4 +-
 arch/parisc/kernel/process.c        |  5 ++-
 arch/powerpc/kernel/process.c       |  5 ++-
 arch/s390/kernel/process.c          |  4 +-
 arch/score/kernel/process.c         |  4 +-
 arch/sh/kernel/process_32.c         |  4 +-
 arch/sh/kernel/process_64.c         |  4 +-
 arch/sparc/kernel/process_32.c      |  4 +-
 arch/sparc/kernel/process_64.c      |  4 +-
 arch/tile/kernel/process.c          |  4 +-
 arch/um/kernel/process.c            | 22 ++++++-----
 arch/unicore32/kernel/process.c     |  4 +-
 arch/x86/kernel/process_32.c        |  5 ++-
 arch/x86/kernel/process_64.c        |  5 ++-
 arch/xtensa/kernel/process.c        |  5 ++-
 fs/coredump.c                       | 79 ++++++++++++++++++++++---------------
 include/linux/pid_namespace.h       |  2 +
 include/linux/sched.h               | 12 +++---
 kernel/fork.c                       | 30 +++++++++-----
 kernel/pid.c                        |  1 +
 kernel/pid_namespace.c              |  3 ++
 kernel/sysctl.c                     | 22 ++++++++---
 42 files changed, 187 insertions(+), 131 deletions(-)

-- 
1.8.5.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 0/3] Make core_pattern support namespace
@ 2016-03-18 12:48 ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Eric W. Biederman, Mateusz Guzik, Zhao Lei

We discussed patch titled:
 [PATCH] Make core_pattern support namespace
before.

Above patch can solve half problem of custom core_dump pattern
in container, but there are also another problem that limit
custom core_pattern in container, it is the pipe-type core_pattern
will write core dump into host's filesystem.
(See discussion of that patch for detail)

Now we can solve the second problem by [PATCH 2/3], I send
the origional patch with it.

Changelog v1->v2:
1: Fix compile fail in xtensa which is reported by
   kbuild test robot <lkp@intel.com>
2: Use a wapper to avoid calling _do_fork() outside kernel/fork.c,
   suggested by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Zhao Lei (3):
  Make _do_fork support return to caller's code
  Run dump pipe in container's namespace
  Make core_pattern support namespace

 arch/alpha/kernel/process.c         |  4 +-
 arch/arc/kernel/process.c           |  4 +-
 arch/arm/kernel/process.c           |  4 +-
 arch/arm64/kernel/process.c         |  5 ++-
 arch/avr32/kernel/process.c         |  5 ++-
 arch/blackfin/kernel/process.c      |  5 ++-
 arch/c6x/kernel/process.c           |  5 ++-
 arch/cris/arch-v10/kernel/process.c |  4 +-
 arch/cris/arch-v32/kernel/process.c |  4 +-
 arch/frv/kernel/process.c           |  4 +-
 arch/h8300/kernel/process.c         |  4 +-
 arch/hexagon/kernel/process.c       |  4 +-
 arch/ia64/kernel/process.c          |  4 +-
 arch/m32r/kernel/process.c          |  4 +-
 arch/m68k/kernel/process.c          |  4 +-
 arch/metag/kernel/process.c         |  5 ++-
 arch/microblaze/kernel/process.c    |  4 +-
 arch/mips/kernel/process.c          |  4 +-
 arch/mn10300/kernel/process.c       |  4 +-
 arch/nios2/kernel/process.c         |  5 ++-
 arch/openrisc/kernel/process.c      |  4 +-
 arch/parisc/kernel/process.c        |  5 ++-
 arch/powerpc/kernel/process.c       |  5 ++-
 arch/s390/kernel/process.c          |  4 +-
 arch/score/kernel/process.c         |  4 +-
 arch/sh/kernel/process_32.c         |  4 +-
 arch/sh/kernel/process_64.c         |  4 +-
 arch/sparc/kernel/process_32.c      |  4 +-
 arch/sparc/kernel/process_64.c      |  4 +-
 arch/tile/kernel/process.c          |  4 +-
 arch/um/kernel/process.c            | 22 ++++++-----
 arch/unicore32/kernel/process.c     |  4 +-
 arch/x86/kernel/process_32.c        |  5 ++-
 arch/x86/kernel/process_64.c        |  5 ++-
 arch/xtensa/kernel/process.c        |  5 ++-
 fs/coredump.c                       | 79 ++++++++++++++++++++++---------------
 include/linux/pid_namespace.h       |  2 +
 include/linux/sched.h               | 12 +++---
 kernel/fork.c                       | 30 +++++++++-----
 kernel/pid.c                        |  1 +
 kernel/pid_namespace.c              |  3 ++
 kernel/sysctl.c                     | 22 ++++++++---
 42 files changed, 187 insertions(+), 131 deletions(-)

-- 
1.8.5.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/3] Make _do_fork support return to caller's code
  2016-03-18 12:48 ` Zhao Lei
@ 2016-03-18 12:48     ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Currently, _do_fork() have following action:
1: Called by user task
   New task return to user space code directly, bypass remain
   kernel space code.
2: Called by kthread
   New task run into callback function in kernel space specified
   by _do_fork() caller.

In somecase, we need to fork a new task in user process's context,
it is used for next patch to fix core dump's security problem in
container.

This patch add a argument named return_to_kernel to _do_fork(),
to make new task always return to kernel_space with this argument.

Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 arch/alpha/kernel/process.c         |  4 ++--
 arch/arc/kernel/process.c           |  4 ++--
 arch/arm/kernel/process.c           |  4 ++--
 arch/arm64/kernel/process.c         |  5 +++--
 arch/avr32/kernel/process.c         |  5 +++--
 arch/blackfin/kernel/process.c      |  5 +++--
 arch/c6x/kernel/process.c           |  5 +++--
 arch/cris/arch-v10/kernel/process.c |  4 ++--
 arch/cris/arch-v32/kernel/process.c |  4 ++--
 arch/frv/kernel/process.c           |  4 ++--
 arch/h8300/kernel/process.c         |  4 ++--
 arch/hexagon/kernel/process.c       |  4 ++--
 arch/ia64/kernel/process.c          |  4 ++--
 arch/m32r/kernel/process.c          |  4 ++--
 arch/m68k/kernel/process.c          |  4 ++--
 arch/metag/kernel/process.c         |  5 +++--
 arch/microblaze/kernel/process.c    |  4 ++--
 arch/mips/kernel/process.c          |  4 ++--
 arch/mn10300/kernel/process.c       |  4 ++--
 arch/nios2/kernel/process.c         |  5 +++--
 arch/openrisc/kernel/process.c      |  4 ++--
 arch/parisc/kernel/process.c        |  5 +++--
 arch/powerpc/kernel/process.c       |  5 +++--
 arch/s390/kernel/process.c          |  4 ++--
 arch/score/kernel/process.c         |  4 ++--
 arch/sh/kernel/process_32.c         |  4 ++--
 arch/sh/kernel/process_64.c         |  4 ++--
 arch/sparc/kernel/process_32.c      |  4 ++--
 arch/sparc/kernel/process_64.c      |  4 ++--
 arch/tile/kernel/process.c          |  4 ++--
 arch/um/kernel/process.c            | 22 ++++++++++++----------
 arch/unicore32/kernel/process.c     |  4 ++--
 arch/x86/kernel/process_32.c        |  5 +++--
 arch/x86/kernel/process_64.c        |  5 +++--
 arch/xtensa/kernel/process.c        |  5 +++--
 include/linux/sched.h               | 11 ++++++-----
 kernel/fork.c                       | 24 ++++++++++++++----------
 37 files changed, 111 insertions(+), 93 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 84d1326..8c8b497 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -241,7 +241,7 @@ release_thread(struct task_struct *dead_task)
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
 	    unsigned long kthread_arg,
-	    struct task_struct *p)
+	    struct task_struct *p, int return_to_kernel)
 {
 	extern void ret_from_fork(void);
 	extern void ret_from_kernel_thread(void);
@@ -255,7 +255,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index a3f750e..252befc 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -88,7 +88,7 @@ asmlinkage void ret_from_fork(void);
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long kthread_arg,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *c_regs;        /* child's pt_regs */
 	unsigned long *childksp;       /* to unwind out of __switch_to() */
@@ -115,7 +115,7 @@ int copy_thread(unsigned long clone_flags,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 4adfb46..f1acaf0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -222,7 +222,7 @@ asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
 int
 copy_thread(unsigned long clone_flags, unsigned long stack_start,
-	    unsigned long stk_sz, struct task_struct *p)
+	    unsigned long stk_sz, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *thread = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
@@ -239,7 +239,7 @@ copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & PF_KTHREAD)) && !return_to_kernel) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 88d742b..62a38eb 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -245,13 +245,14 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 asmlinkage void ret_from_fork(void) asm("ret_from_fork");
 
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-		unsigned long stk_sz, struct task_struct *p)
+		unsigned long stk_sz, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & PF_KTHREAD)) && !return_to_kernel) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 42a53e74..4ac993f 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -281,11 +281,12 @@ asmlinkage void syscall_return(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long arg,
-		struct task_struct *p)
+		struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.cpu_context.r0 = arg;
 		p->thread.cpu_context.r1 = usp; /* fn */
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 4aa5545..23d8655 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -112,14 +112,15 @@ asmlinkage int bfin_clone(unsigned long clone_flags, unsigned long newsp)
 int
 copy_thread(unsigned long clone_flags,
 	    unsigned long usp, unsigned long topstk,
-	    struct task_struct *p)
+	    struct task_struct *p,
+	    int return_to_kernel)
 {
 	struct pt_regs *childregs;
 	unsigned long *v;
 
 	childregs = (struct pt_regs *) (task_stack_page(p) + THREAD_SIZE) - 1;
 	v = ((unsigned long *)childregs) - 2;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		v[0] = usp;
 		v[1] = topstk;
diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
index 3ae9f5a..31fa2ad 100644
--- a/arch/c6x/kernel/process.c
+++ b/arch/c6x/kernel/process.c
@@ -112,13 +112,14 @@ void start_thread(struct pt_regs *regs, unsigned int pc, unsigned long usp)
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long ustk_size,
-		struct task_struct *p)
+		struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
 	childregs = task_pt_regs(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* case of  __kernel_thread: we return to supervisor space */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->sp = (unsigned long)(childregs + 1);
diff --git a/arch/cris/arch-v10/kernel/process.c b/arch/cris/arch-v10/kernel/process.c
index 02b7834..d03a6bf 100644
--- a/arch/cris/arch-v10/kernel/process.c
+++ b/arch/cris/arch-v10/kernel/process.c
@@ -95,7 +95,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct switch_stack *swstack = ((struct switch_stack *)childregs) - 1;
@@ -104,7 +104,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	 * remember that the task_struct doubles as the kernel stack for the task
 	 */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(swstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		swstack->r1 = usp;
diff --git a/arch/cris/arch-v32/kernel/process.c b/arch/cris/arch-v32/kernel/process.c
index c7ce784..aa6b71f 100644
--- a/arch/cris/arch-v32/kernel/process.c
+++ b/arch/cris/arch-v32/kernel/process.c
@@ -103,7 +103,7 @@ extern asmlinkage void ret_from_kernel_thread(void);
 
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	unsigned long arg, struct task_struct *p)
+	unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct switch_stack *swstack = ((struct switch_stack *) childregs) - 1;
@@ -113,7 +113,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	 * fix it up. Note: the task_struct doubles as the kernel stack for the
 	 * task.
 	 */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(swstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		swstack->r1 = usp;
diff --git a/arch/frv/kernel/process.c b/arch/frv/kernel/process.c
index 5d40aeb77..5f7d8a8 100644
--- a/arch/frv/kernel/process.c
+++ b/arch/frv/kernel/process.c
@@ -127,7 +127,7 @@ inline unsigned long user_stack(const struct pt_regs *regs)
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long arg,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
@@ -144,7 +144,7 @@ int copy_thread(unsigned long clone_flags,
 	p->thread.lr	 = 0;
 	p->thread.frame0 = childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		childregs->gr9 = usp; /* function */
 		childregs->gr8 = arg;
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index dee4125..c6613bd 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -104,13 +104,13 @@ void flush_thread(void)
 
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long topstk,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index a9ebd47..920604b 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -69,7 +69,7 @@ unsigned long thread_saved_pc(struct task_struct *tsk)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct hexagon_switch_stack *ss;
@@ -91,7 +91,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index b515149..a5ff431 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -333,7 +333,7 @@ ia64_load_extra (struct task_struct *task)
 int
 copy_thread(unsigned long clone_flags,
 	     unsigned long user_stack_base, unsigned long user_stack_size,
-	     struct task_struct *p)
+	     struct task_struct *p, int return_to_kernel)
 {
 	extern char ia64_ret_from_clone;
 	struct switch_stack *child_stack, *stack;
@@ -375,7 +375,7 @@ copy_thread(unsigned long clone_flags,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c
index e69221d..aec7068 100644
--- a/arch/m32r/kernel/process.c
+++ b/arch/m32r/kernel/process.c
@@ -129,13 +129,13 @@ int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpu)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long spu,
-	unsigned long arg, struct task_struct *tsk)
+	unsigned long arg, struct task_struct *tsk, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(tsk);
 	extern void ret_from_fork(void);
 	extern void ret_from_kernel_thread(void);
 
-	if (unlikely(tsk->flags & PF_KTHREAD)) {
+	if (unlikely(tsk->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->psw = M32R_PSW_BIE;
 		childregs->r1 = spu;	/* fn */
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index c55ff71..2c3e3d3 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -130,7 +130,7 @@ asmlinkage int m68k_clone(struct pt_regs *regs)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		 unsigned long arg, struct task_struct *p)
+		 unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct fork_frame {
 		struct switch_stack sw;
@@ -148,7 +148,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	 */
 	p->thread.fs = get_fs().seg;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 7f54618..2e78cd4 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -178,7 +178,8 @@ void show_regs(struct pt_regs *regs)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long kthread_arg, struct task_struct *tsk)
+		unsigned long kthread_arg, struct task_struct *tsk,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(tsk);
 	void *kernel_context = ((void *) childregs +
@@ -195,7 +196,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 						     ret_from_fork,
 						     0, 0);
 
-	if (unlikely(tsk->flags & PF_KTHREAD)) {
+	if (unlikely(tsk->flags & PF_KTHREAD) || return_to_kernel) {
 		/*
 		 * Make sure we don't leak any kernel data to child's regs
 		 * if kernel thread becomes a userspace thread in the future
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index b2dd3719..60e6e80 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -52,12 +52,12 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index eddd5fd..be41191 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -109,7 +109,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-	unsigned long kthread_arg, struct task_struct *p)
+	unsigned long kthread_arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
@@ -123,7 +123,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = read_c0_status() & ~(ST0_CU2|ST0_CU1);
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c
index 3707da5..cbf0fbe 100644
--- a/arch/mn10300/kernel/process.c
+++ b/arch/mn10300/kernel/process.c
@@ -142,7 +142,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long c_usp, unsigned long ustk_size,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *c_regs;
@@ -163,7 +163,7 @@ int copy_thread(unsigned long clone_flags,
 	p->thread.wchan	= p->thread.pc;
 	p->thread.usp	= c_usp;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 		c_regs->a0 = c_usp; /* function */
 		c_regs->d0 = ustk_size; /* argument */
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index 2f8c74f..e1fe2c0 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -97,7 +97,8 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags,
-		unsigned long usp, unsigned long arg, struct task_struct *p)
+		unsigned long usp, unsigned long arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct pt_regs *regs;
@@ -105,7 +106,7 @@ int copy_thread(unsigned long clone_flags,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 7095dfe..a54fc66 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -143,7 +143,7 @@ extern asmlinkage void ret_from_fork(void);
 
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	    unsigned long arg, struct task_struct *p)
+	    unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *userregs;
 	struct pt_regs *kregs;
@@ -164,7 +164,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index 809905a..7ed7eb63 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -186,7 +186,8 @@ int dump_task_fpu (struct task_struct *tsk, elf_fpregset_t *r)
  */
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	    unsigned long kthread_arg, struct task_struct *p)
+	    unsigned long kthread_arg, struct task_struct *p,
+	    int return_to_kernel)
 {
 	struct pt_regs *cregs = &(p->thread.regs);
 	void *stack = task_stack_page(p);
@@ -197,7 +198,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	extern void * const ret_from_kernel_thread;
 	extern void * const child_return;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(cregs, 0, sizeof(struct pt_regs));
 		if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 3c5736e..208929f 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1260,7 +1260,8 @@ static void setup_ksp_vsid(struct task_struct *p, unsigned long sp)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long kthread_arg, struct task_struct *p)
+		unsigned long kthread_arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs, *kregs;
 	extern void ret_from_fork(void);
@@ -1271,7 +1272,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/* Copy registers */
 	sp -= sizeof(struct pt_regs);
 	childregs = (struct pt_regs *) sp;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		struct thread_info *ti = (void *)task_stack_page(p);
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 2bba7df..8e8e9ef 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -118,7 +118,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti;
 	struct fake_frame
@@ -149,7 +149,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long) frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/score/kernel/process.c b/arch/score/kernel/process.c
index a1519ad3..937115c 100644
--- a/arch/score/kernel/process.c
+++ b/arch/score/kernel/process.c
@@ -69,14 +69,14 @@ void flush_thread(void) {}
  * set up the kernel stack and exception frames for a new process
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct pt_regs *regs = current_pt_regs();
 
 	p->thread.reg0 = (unsigned long) childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.reg12 = usp;
 		p->thread.reg13 = arg;
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 2885fc9..c8fdba3 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -124,7 +124,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs;
@@ -145,7 +145,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c
index e2062e6..31e116a 100644
--- a/arch/sh/kernel/process_64.c
+++ b/arch/sh/kernel/process_64.c
@@ -372,7 +372,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
@@ -390,7 +390,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childregs = (struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1;
 	p->thread.sp = (unsigned long) childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->regs[2] = (unsigned long)arg;
 		childregs->regs[3] = (unsigned long)usp;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index c5113c7..327bef8 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -306,7 +306,7 @@ extern void ret_from_fork(void);
 extern void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
@@ -342,7 +342,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 46a5964..0dcd4ba 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -617,7 +617,7 @@ asmlinkage long sparc_do_fork(unsigned long clone_flags,
  * Child  -->  %o0 == parents pid, %o1 == 1
  */
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *t = task_thread_info(p);
 	struct pt_regs *regs = current_pt_regs();
@@ -636,7 +636,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d3..6b24d49 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -100,7 +100,7 @@ void arch_release_thread_info(struct thread_info *info)
 static void save_arch_state(struct thread_struct *t);
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	unsigned long ksp;
@@ -126,7 +126,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	/* Record the pid of the task that created this one. */
 	p->thread.creator_pid = current->pid;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		memset(&callee_regs[2], 0,
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 48af59a..40ac5bf 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -155,7 +155,8 @@ void fork_handler(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct * p)
+		unsigned long arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	void (*handler)(void);
 	int kthread = current->flags & PF_KTHREAD;
@@ -163,7 +164,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 
 	p->thread = (struct thread_struct) INIT_THREAD;
 
-	if (!kthread) {
+	if (!kthread && !return_to_kernel) {
 	  	memcpy(&p->thread.regs.regs, current_pt_regs(),
 		       sizeof(p->thread.regs.regs));
 		PT_REGS_SET_SYSCALL_RETURN(&p->thread.regs, 0);
@@ -182,15 +183,16 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 
 	new_thread(task_stack_page(p), &p->thread.switch_buf, handler);
 
-	if (!kthread) {
-		clear_flushed_tls(p);
+	if (kthread || return_to_kernel)
+		return ret;
 
-		/*
-		 * Set a new TLS for the child thread?
-		 */
-		if (clone_flags & CLONE_SETTLS)
-			ret = arch_copy_tls(p);
-	}
+	clear_flushed_tls(p);
+
+	/*
+	 * Set a new TLS for the child thread?
+	 */
+	if (clone_flags & CLONE_SETTLS)
+		ret = arch_copy_tls(p);
 
 	return ret;
 }
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index b008e99..159961d 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -229,14 +229,14 @@ asmlinkage void ret_from_kernel_thread(void) __asm__("ret_from_kernel_thread");
 
 int
 copy_thread(unsigned long clone_flags, unsigned long stack_start,
-	    unsigned long stk_sz, struct task_struct *p)
+	    unsigned long stk_sz, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *thread = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	memset(&thread->cpu_context, 0, sizeof(struct cpu_context_save));
 	thread->cpu_context.sp = (unsigned long)childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		thread->cpu_context.pc = (unsigned long)ret_from_kernel_thread;
 		thread->cpu_context.r4 = stack_start;
 		thread->cpu_context.r5 = stk_sz;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 9f95091..2b1862e 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -130,7 +130,8 @@ void release_thread(struct task_struct *dead_task)
 }
 
 int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p, unsigned long tls)
+	unsigned long arg, struct task_struct *p, unsigned long tls,
+	int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -140,7 +141,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp0 = (unsigned long) (childregs+1);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.ip = (unsigned long) ret_from_kernel_thread;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b9d99e0..de05bc0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -153,7 +153,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 }
 
 int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p, unsigned long tls)
+		unsigned long arg, struct task_struct *p, unsigned long tls,
+		int return_to_kernel)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -173,7 +174,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	savesegment(ds, p->thread.ds);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->sp = (unsigned long)childregs;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 1c85323..7765634 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -189,7 +189,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  */
 
 int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
-		unsigned long thread_fn_arg, struct task_struct *p)
+		unsigned long thread_fn_arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
@@ -203,7 +204,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 
 	p->thread.sp = (unsigned long)childregs;
 
-	if (!(p->flags & PF_KTHREAD)) {
+	if (!(p->flags & PF_KTHREAD) && !return_to_kernel) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..56401e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2612,18 +2612,18 @@ extern void mm_release(struct task_struct *, struct mm_struct *);
 
 #ifdef CONFIG_HAVE_COPY_THREAD_TLS
 extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
-			struct task_struct *, unsigned long);
+			struct task_struct *, unsigned long, int);
 #else
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
-			struct task_struct *);
+			struct task_struct *, int);
 
 /* Architectures that haven't opted into copy_thread_tls get the tls argument
  * via pt_regs, so ignore the tls argument passed via C. */
 static inline int copy_thread_tls(
 		unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, int return_to_kernel)
 {
-	return copy_thread(clone_flags, sp, arg, p);
+	return copy_thread(clone_flags, sp, arg, p, return_to_kernel);
 }
 #endif
 extern void flush_thread(void);
@@ -2644,7 +2644,8 @@ extern int do_execveat(int, struct filename *,
 		       const char __user * const __user *,
 		       const char __user * const __user *,
 		       int);
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
+		     int __user *, unsigned long, int);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..643a09b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1245,7 +1245,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls)
+					unsigned long tls,
+					int return_to_kernel)
 {
 	int retval;
 	struct task_struct *p;
@@ -1451,7 +1452,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
+	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls,
+				 return_to_kernel);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1673,7 +1675,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1693,7 +1695,8 @@ long _do_fork(unsigned long clone_flags,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr,
-	      unsigned long tls)
+	      unsigned long tls,
+	      int return_to_kernel)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1718,7 +1721,7 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls);
+			 child_tidptr, NULL, trace, tls, return_to_kernel);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1769,7 +1772,7 @@ long do_fork(unsigned long clone_flags,
 	      int __user *child_tidptr)
 {
 	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+			parent_tidptr, child_tidptr, 0, 0);
 }
 #endif
 
@@ -1779,14 +1782,14 @@ long do_fork(unsigned long clone_flags,
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
 	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+		(unsigned long)arg, NULL, NULL, 0, 0);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, 0);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1798,7 +1801,7 @@ SYSCALL_DEFINE0(fork)
 SYSCALL_DEFINE0(vfork)
 {
 	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+			0, NULL, NULL, 0, 0);
 }
 #endif
 
@@ -1826,7 +1829,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
+			tls, 0);
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 1/3] Make _do_fork support return to caller's code
@ 2016-03-18 12:48     ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Eric W. Biederman, Mateusz Guzik, Zhao Lei

Currently, _do_fork() have following action:
1: Called by user task
   New task return to user space code directly, bypass remain
   kernel space code.
2: Called by kthread
   New task run into callback function in kernel space specified
   by _do_fork() caller.

In somecase, we need to fork a new task in user process's context,
it is used for next patch to fix core dump's security problem in
container.

This patch add a argument named return_to_kernel to _do_fork(),
to make new task always return to kernel_space with this argument.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 arch/alpha/kernel/process.c         |  4 ++--
 arch/arc/kernel/process.c           |  4 ++--
 arch/arm/kernel/process.c           |  4 ++--
 arch/arm64/kernel/process.c         |  5 +++--
 arch/avr32/kernel/process.c         |  5 +++--
 arch/blackfin/kernel/process.c      |  5 +++--
 arch/c6x/kernel/process.c           |  5 +++--
 arch/cris/arch-v10/kernel/process.c |  4 ++--
 arch/cris/arch-v32/kernel/process.c |  4 ++--
 arch/frv/kernel/process.c           |  4 ++--
 arch/h8300/kernel/process.c         |  4 ++--
 arch/hexagon/kernel/process.c       |  4 ++--
 arch/ia64/kernel/process.c          |  4 ++--
 arch/m32r/kernel/process.c          |  4 ++--
 arch/m68k/kernel/process.c          |  4 ++--
 arch/metag/kernel/process.c         |  5 +++--
 arch/microblaze/kernel/process.c    |  4 ++--
 arch/mips/kernel/process.c          |  4 ++--
 arch/mn10300/kernel/process.c       |  4 ++--
 arch/nios2/kernel/process.c         |  5 +++--
 arch/openrisc/kernel/process.c      |  4 ++--
 arch/parisc/kernel/process.c        |  5 +++--
 arch/powerpc/kernel/process.c       |  5 +++--
 arch/s390/kernel/process.c          |  4 ++--
 arch/score/kernel/process.c         |  4 ++--
 arch/sh/kernel/process_32.c         |  4 ++--
 arch/sh/kernel/process_64.c         |  4 ++--
 arch/sparc/kernel/process_32.c      |  4 ++--
 arch/sparc/kernel/process_64.c      |  4 ++--
 arch/tile/kernel/process.c          |  4 ++--
 arch/um/kernel/process.c            | 22 ++++++++++++----------
 arch/unicore32/kernel/process.c     |  4 ++--
 arch/x86/kernel/process_32.c        |  5 +++--
 arch/x86/kernel/process_64.c        |  5 +++--
 arch/xtensa/kernel/process.c        |  5 +++--
 include/linux/sched.h               | 11 ++++++-----
 kernel/fork.c                       | 24 ++++++++++++++----------
 37 files changed, 111 insertions(+), 93 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 84d1326..8c8b497 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -241,7 +241,7 @@ release_thread(struct task_struct *dead_task)
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
 	    unsigned long kthread_arg,
-	    struct task_struct *p)
+	    struct task_struct *p, int return_to_kernel)
 {
 	extern void ret_from_fork(void);
 	extern void ret_from_kernel_thread(void);
@@ -255,7 +255,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index a3f750e..252befc 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -88,7 +88,7 @@ asmlinkage void ret_from_fork(void);
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long kthread_arg,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *c_regs;        /* child's pt_regs */
 	unsigned long *childksp;       /* to unwind out of __switch_to() */
@@ -115,7 +115,7 @@ int copy_thread(unsigned long clone_flags,
 	childksp[0] = 0;			/* fp */
 	childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 
 		c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 4adfb46..f1acaf0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -222,7 +222,7 @@ asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
 int
 copy_thread(unsigned long clone_flags, unsigned long stack_start,
-	    unsigned long stk_sz, struct task_struct *p)
+	    unsigned long stk_sz, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *thread = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
@@ -239,7 +239,7 @@ copy_thread(unsigned long clone_flags, unsigned long stack_start,
 	thread->cpu_domain = get_domain();
 #endif
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & PF_KTHREAD)) && !return_to_kernel) {
 		*childregs = *current_pt_regs();
 		childregs->ARM_r0 = 0;
 		if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 88d742b..62a38eb 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -245,13 +245,14 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 asmlinkage void ret_from_fork(void) asm("ret_from_fork");
 
 int copy_thread(unsigned long clone_flags, unsigned long stack_start,
-		unsigned long stk_sz, struct task_struct *p)
+		unsigned long stk_sz, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
 
-	if (likely(!(p->flags & PF_KTHREAD))) {
+	if (likely(!(p->flags & PF_KTHREAD)) && !return_to_kernel) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
 
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 42a53e74..4ac993f 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -281,11 +281,12 @@ asmlinkage void syscall_return(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long arg,
-		struct task_struct *p)
+		struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.cpu_context.r0 = arg;
 		p->thread.cpu_context.r1 = usp; /* fn */
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 4aa5545..23d8655 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -112,14 +112,15 @@ asmlinkage int bfin_clone(unsigned long clone_flags, unsigned long newsp)
 int
 copy_thread(unsigned long clone_flags,
 	    unsigned long usp, unsigned long topstk,
-	    struct task_struct *p)
+	    struct task_struct *p,
+	    int return_to_kernel)
 {
 	struct pt_regs *childregs;
 	unsigned long *v;
 
 	childregs = (struct pt_regs *) (task_stack_page(p) + THREAD_SIZE) - 1;
 	v = ((unsigned long *)childregs) - 2;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		v[0] = usp;
 		v[1] = topstk;
diff --git a/arch/c6x/kernel/process.c b/arch/c6x/kernel/process.c
index 3ae9f5a..31fa2ad 100644
--- a/arch/c6x/kernel/process.c
+++ b/arch/c6x/kernel/process.c
@@ -112,13 +112,14 @@ void start_thread(struct pt_regs *regs, unsigned int pc, unsigned long usp)
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
 		unsigned long ustk_size,
-		struct task_struct *p)
+		struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
 	childregs = task_pt_regs(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* case of  __kernel_thread: we return to supervisor space */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->sp = (unsigned long)(childregs + 1);
diff --git a/arch/cris/arch-v10/kernel/process.c b/arch/cris/arch-v10/kernel/process.c
index 02b7834..d03a6bf 100644
--- a/arch/cris/arch-v10/kernel/process.c
+++ b/arch/cris/arch-v10/kernel/process.c
@@ -95,7 +95,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct switch_stack *swstack = ((struct switch_stack *)childregs) - 1;
@@ -104,7 +104,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	 * remember that the task_struct doubles as the kernel stack for the task
 	 */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(swstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		swstack->r1 = usp;
diff --git a/arch/cris/arch-v32/kernel/process.c b/arch/cris/arch-v32/kernel/process.c
index c7ce784..aa6b71f 100644
--- a/arch/cris/arch-v32/kernel/process.c
+++ b/arch/cris/arch-v32/kernel/process.c
@@ -103,7 +103,7 @@ extern asmlinkage void ret_from_kernel_thread(void);
 
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	unsigned long arg, struct task_struct *p)
+	unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct switch_stack *swstack = ((struct switch_stack *) childregs) - 1;
@@ -113,7 +113,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	 * fix it up. Note: the task_struct doubles as the kernel stack for the
 	 * task.
 	 */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(swstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 		swstack->r1 = usp;
diff --git a/arch/frv/kernel/process.c b/arch/frv/kernel/process.c
index 5d40aeb77..5f7d8a8 100644
--- a/arch/frv/kernel/process.c
+++ b/arch/frv/kernel/process.c
@@ -127,7 +127,7 @@ inline unsigned long user_stack(const struct pt_regs *regs)
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long arg,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
@@ -144,7 +144,7 @@ int copy_thread(unsigned long clone_flags,
 	p->thread.lr	 = 0;
 	p->thread.frame0 = childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		childregs->gr9 = usp; /* function */
 		childregs->gr8 = arg;
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index dee4125..c6613bd 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -104,13 +104,13 @@ void flush_thread(void)
 
 int copy_thread(unsigned long clone_flags,
 		unsigned long usp, unsigned long topstk,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
 	childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->retpc = (unsigned long) ret_from_kernel_thread;
 		childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index a9ebd47..920604b 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -69,7 +69,7 @@ unsigned long thread_saved_pc(struct task_struct *tsk)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct hexagon_switch_stack *ss;
@@ -91,7 +91,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 						    sizeof(*ss));
 	ss->lr = (unsigned long)ret_from_fork;
 	p->thread.switch_sp = ss;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		/* r24 <- fn, r25 <- arg */
 		ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index b515149..a5ff431 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -333,7 +333,7 @@ ia64_load_extra (struct task_struct *task)
 int
 copy_thread(unsigned long clone_flags,
 	     unsigned long user_stack_base, unsigned long user_stack_size,
-	     struct task_struct *p)
+	     struct task_struct *p, int return_to_kernel)
 {
 	extern char ia64_ret_from_clone;
 	struct switch_stack *child_stack, *stack;
@@ -375,7 +375,7 @@ copy_thread(unsigned long clone_flags,
 
 	ia64_drop_fpu(p);	/* don't pick up stale state from a CPU's fph */
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		if (unlikely(!user_stack_base)) {
 			/* fork_idle() called us */
 			return 0;
diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c
index e69221d..aec7068 100644
--- a/arch/m32r/kernel/process.c
+++ b/arch/m32r/kernel/process.c
@@ -129,13 +129,13 @@ int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpu)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long spu,
-	unsigned long arg, struct task_struct *tsk)
+	unsigned long arg, struct task_struct *tsk, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(tsk);
 	extern void ret_from_fork(void);
 	extern void ret_from_kernel_thread(void);
 
-	if (unlikely(tsk->flags & PF_KTHREAD)) {
+	if (unlikely(tsk->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->psw = M32R_PSW_BIE;
 		childregs->r1 = spu;	/* fn */
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index c55ff71..2c3e3d3 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -130,7 +130,7 @@ asmlinkage int m68k_clone(struct pt_regs *regs)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		 unsigned long arg, struct task_struct *p)
+		 unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct fork_frame {
 		struct switch_stack sw;
@@ -148,7 +148,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	 */
 	p->thread.fs = get_fs().seg;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(frame, 0, sizeof(struct fork_frame));
 		frame->regs.sr = PS_S;
diff --git a/arch/metag/kernel/process.c b/arch/metag/kernel/process.c
index 7f54618..2e78cd4 100644
--- a/arch/metag/kernel/process.c
+++ b/arch/metag/kernel/process.c
@@ -178,7 +178,8 @@ void show_regs(struct pt_regs *regs)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long kthread_arg, struct task_struct *tsk)
+		unsigned long kthread_arg, struct task_struct *tsk,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(tsk);
 	void *kernel_context = ((void *) childregs +
@@ -195,7 +196,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 						     ret_from_fork,
 						     0, 0);
 
-	if (unlikely(tsk->flags & PF_KTHREAD)) {
+	if (unlikely(tsk->flags & PF_KTHREAD) || return_to_kernel) {
 		/*
 		 * Make sure we don't leak any kernel data to child's regs
 		 * if kernel thread becomes a userspace thread in the future
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index b2dd3719..60e6e80 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -52,12 +52,12 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct thread_info *ti = task_thread_info(p);
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* if we're creating a new kernel thread then just zeroing all
 		 * the registers. That's OK for a brand new thread.*/
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index eddd5fd..be41191 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -109,7 +109,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-	unsigned long kthread_arg, struct task_struct *p)
+	unsigned long kthread_arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
@@ -123,7 +123,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/*  Put the stack after the struct pt_regs.  */
 	childksp = (unsigned long) childregs;
 	p->thread.cp0_status = read_c0_status() & ~(ST0_CU2|ST0_CU1);
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		unsigned long status = p->thread.cp0_status;
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c
index 3707da5..cbf0fbe 100644
--- a/arch/mn10300/kernel/process.c
+++ b/arch/mn10300/kernel/process.c
@@ -142,7 +142,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  */
 int copy_thread(unsigned long clone_flags,
 		unsigned long c_usp, unsigned long ustk_size,
-		struct task_struct *p)
+		struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *c_regs;
@@ -163,7 +163,7 @@ int copy_thread(unsigned long clone_flags,
 	p->thread.wchan	= p->thread.pc;
 	p->thread.usp	= c_usp;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(c_regs, 0, sizeof(struct pt_regs));
 		c_regs->a0 = c_usp; /* function */
 		c_regs->d0 = ustk_size; /* argument */
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index 2f8c74f..e1fe2c0 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -97,7 +97,8 @@ void flush_thread(void)
 }
 
 int copy_thread(unsigned long clone_flags,
-		unsigned long usp, unsigned long arg, struct task_struct *p)
+		unsigned long usp, unsigned long arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct pt_regs *regs;
@@ -105,7 +106,7 @@ int copy_thread(unsigned long clone_flags,
 	struct switch_stack *childstack =
 		((struct switch_stack *)childregs) - 1;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childstack, 0,
 			sizeof(struct switch_stack) + sizeof(struct pt_regs));
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 7095dfe..a54fc66 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -143,7 +143,7 @@ extern asmlinkage void ret_from_fork(void);
 
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	    unsigned long arg, struct task_struct *p)
+	    unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *userregs;
 	struct pt_regs *kregs;
@@ -164,7 +164,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	sp -= sizeof(struct pt_regs);
 	kregs = (struct pt_regs *)sp;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(kregs, 0, sizeof(struct pt_regs));
 		kregs->gpr[20] = usp; /* fn, kernel thread */
 		kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index 809905a..7ed7eb63 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -186,7 +186,8 @@ int dump_task_fpu (struct task_struct *tsk, elf_fpregset_t *r)
  */
 int
 copy_thread(unsigned long clone_flags, unsigned long usp,
-	    unsigned long kthread_arg, struct task_struct *p)
+	    unsigned long kthread_arg, struct task_struct *p,
+	    int return_to_kernel)
 {
 	struct pt_regs *cregs = &(p->thread.regs);
 	void *stack = task_stack_page(p);
@@ -197,7 +198,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 	extern void * const ret_from_kernel_thread;
 	extern void * const child_return;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(cregs, 0, sizeof(struct pt_regs));
 		if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 3c5736e..208929f 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1260,7 +1260,8 @@ static void setup_ksp_vsid(struct task_struct *p, unsigned long sp)
  * Copy architecture-specific thread state
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long kthread_arg, struct task_struct *p)
+		unsigned long kthread_arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs, *kregs;
 	extern void ret_from_fork(void);
@@ -1271,7 +1272,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	/* Copy registers */
 	sp -= sizeof(struct pt_regs);
 	childregs = (struct pt_regs *) sp;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		struct thread_info *ti = (void *)task_stack_page(p);
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 2bba7df..8e8e9ef 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -118,7 +118,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti;
 	struct fake_frame
@@ -149,7 +149,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
 	frame->sf.gprs[9] = (unsigned long) frame;
 
 	/* Store access registers to kernel stack of new process. */
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(&frame->childregs, 0, sizeof(struct pt_regs));
 		frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/score/kernel/process.c b/arch/score/kernel/process.c
index a1519ad3..937115c 100644
--- a/arch/score/kernel/process.c
+++ b/arch/score/kernel/process.c
@@ -69,14 +69,14 @@ void flush_thread(void) {}
  * set up the kernel stack and exception frames for a new process
  */
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct pt_regs *regs = current_pt_regs();
 
 	p->thread.reg0 = (unsigned long) childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.reg12 = usp;
 		p->thread.reg13 = arg;
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 2885fc9..c8fdba3 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -124,7 +124,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs;
@@ -145,7 +145,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.pc = (unsigned long) ret_from_kernel_thread;
 		childregs->regs[4] = arg;
diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c
index e2062e6..31e116a 100644
--- a/arch/sh/kernel/process_64.c
+++ b/arch/sh/kernel/process_64.c
@@ -372,7 +372,7 @@ asmlinkage void ret_from_fork(void);
 asmlinkage void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long usp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs;
 
@@ -390,7 +390,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childregs = (struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1;
 	p->thread.sp = (unsigned long) childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->regs[2] = (unsigned long)arg;
 		childregs->regs[3] = (unsigned long)usp;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index c5113c7..327bef8 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -306,7 +306,7 @@ extern void ret_from_fork(void);
 extern void ret_from_kernel_thread(void);
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *ti = task_thread_info(p);
 	struct pt_regs *childregs, *regs = current_pt_regs();
@@ -342,7 +342,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	ti->ksp = (unsigned long) new_stack;
 	p->thread.kregs = childregs;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		extern int nwindows;
 		unsigned long psr;
 		memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 46a5964..0dcd4ba 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -617,7 +617,7 @@ asmlinkage long sparc_do_fork(unsigned long clone_flags,
  * Child  -->  %o0 == parents pid, %o1 == 1
  */
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *t = task_thread_info(p);
 	struct pt_regs *regs = current_pt_regs();
@@ -636,7 +636,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 				       sizeof(struct sparc_stackf));
 	t->fpsaved[0] = 0;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		memset(child_trap_frame, 0, child_stack_sz);
 		__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] = 
 			(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d3..6b24d49 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -100,7 +100,7 @@ void arch_release_thread_info(struct thread_info *info)
 static void save_arch_state(struct thread_struct *t);
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p)
+		unsigned long arg, struct task_struct *p, int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	unsigned long ksp;
@@ -126,7 +126,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	/* Record the pid of the task that created this one. */
 	p->thread.creator_pid = current->pid;
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		memset(&callee_regs[2], 0,
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 48af59a..40ac5bf 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -155,7 +155,8 @@ void fork_handler(void)
 }
 
 int copy_thread(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct * p)
+		unsigned long arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	void (*handler)(void);
 	int kthread = current->flags & PF_KTHREAD;
@@ -163,7 +164,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 
 	p->thread = (struct thread_struct) INIT_THREAD;
 
-	if (!kthread) {
+	if (!kthread && !return_to_kernel) {
 	  	memcpy(&p->thread.regs.regs, current_pt_regs(),
 		       sizeof(p->thread.regs.regs));
 		PT_REGS_SET_SYSCALL_RETURN(&p->thread.regs, 0);
@@ -182,15 +183,16 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 
 	new_thread(task_stack_page(p), &p->thread.switch_buf, handler);
 
-	if (!kthread) {
-		clear_flushed_tls(p);
+	if (kthread || return_to_kernel)
+		return ret;
 
-		/*
-		 * Set a new TLS for the child thread?
-		 */
-		if (clone_flags & CLONE_SETTLS)
-			ret = arch_copy_tls(p);
-	}
+	clear_flushed_tls(p);
+
+	/*
+	 * Set a new TLS for the child thread?
+	 */
+	if (clone_flags & CLONE_SETTLS)
+		ret = arch_copy_tls(p);
 
 	return ret;
 }
diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index b008e99..159961d 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -229,14 +229,14 @@ asmlinkage void ret_from_kernel_thread(void) __asm__("ret_from_kernel_thread");
 
 int
 copy_thread(unsigned long clone_flags, unsigned long stack_start,
-	    unsigned long stk_sz, struct task_struct *p)
+	    unsigned long stk_sz, struct task_struct *p, int return_to_kernel)
 {
 	struct thread_info *thread = task_thread_info(p);
 	struct pt_regs *childregs = task_pt_regs(p);
 
 	memset(&thread->cpu_context, 0, sizeof(struct cpu_context_save));
 	thread->cpu_context.sp = (unsigned long)childregs;
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		thread->cpu_context.pc = (unsigned long)ret_from_kernel_thread;
 		thread->cpu_context.r4 = stack_start;
 		thread->cpu_context.r5 = stk_sz;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 9f95091..2b1862e 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -130,7 +130,8 @@ void release_thread(struct task_struct *dead_task)
 }
 
 int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-	unsigned long arg, struct task_struct *p, unsigned long tls)
+	unsigned long arg, struct task_struct *p, unsigned long tls,
+	int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 	struct task_struct *tsk;
@@ -140,7 +141,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp0 = (unsigned long) (childregs+1);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		p->thread.ip = (unsigned long) ret_from_kernel_thread;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b9d99e0..de05bc0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -153,7 +153,8 @@ static inline u32 read_32bit_tls(struct task_struct *t, int tls)
 }
 
 int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
-		unsigned long arg, struct task_struct *p, unsigned long tls)
+		unsigned long arg, struct task_struct *p, unsigned long tls,
+		int return_to_kernel)
 {
 	int err;
 	struct pt_regs *childregs;
@@ -173,7 +174,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	savesegment(ds, p->thread.ds);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
-	if (unlikely(p->flags & PF_KTHREAD)) {
+	if (unlikely(p->flags & PF_KTHREAD) || return_to_kernel) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
 		childregs->sp = (unsigned long)childregs;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 1c85323..7765634 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -189,7 +189,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
  */
 
 int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
-		unsigned long thread_fn_arg, struct task_struct *p)
+		unsigned long thread_fn_arg, struct task_struct *p,
+		int return_to_kernel)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
 
@@ -203,7 +204,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
 
 	p->thread.sp = (unsigned long)childregs;
 
-	if (!(p->flags & PF_KTHREAD)) {
+	if (!(p->flags & PF_KTHREAD) && !return_to_kernel) {
 		struct pt_regs *regs = current_pt_regs();
 		unsigned long usp = usp_thread_fn ?
 			usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..56401e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2612,18 +2612,18 @@ extern void mm_release(struct task_struct *, struct mm_struct *);
 
 #ifdef CONFIG_HAVE_COPY_THREAD_TLS
 extern int copy_thread_tls(unsigned long, unsigned long, unsigned long,
-			struct task_struct *, unsigned long);
+			struct task_struct *, unsigned long, int);
 #else
 extern int copy_thread(unsigned long, unsigned long, unsigned long,
-			struct task_struct *);
+			struct task_struct *, int);
 
 /* Architectures that haven't opted into copy_thread_tls get the tls argument
  * via pt_regs, so ignore the tls argument passed via C. */
 static inline int copy_thread_tls(
 		unsigned long clone_flags, unsigned long sp, unsigned long arg,
-		struct task_struct *p, unsigned long tls)
+		struct task_struct *p, unsigned long tls, int return_to_kernel)
 {
-	return copy_thread(clone_flags, sp, arg, p);
+	return copy_thread(clone_flags, sp, arg, p, return_to_kernel);
 }
 #endif
 extern void flush_thread(void);
@@ -2644,7 +2644,8 @@ extern int do_execveat(int, struct filename *,
 		       const char __user * const __user *,
 		       const char __user * const __user *,
 		       int);
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
+		     int __user *, unsigned long, int);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..643a09b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1245,7 +1245,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls)
+					unsigned long tls,
+					int return_to_kernel)
 {
 	int retval;
 	struct task_struct *p;
@@ -1451,7 +1452,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
-	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
+	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls,
+				 return_to_kernel);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -1673,7 +1675,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1693,7 +1695,8 @@ long _do_fork(unsigned long clone_flags,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr,
-	      unsigned long tls)
+	      unsigned long tls,
+	      int return_to_kernel)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1718,7 +1721,7 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls);
+			 child_tidptr, NULL, trace, tls, return_to_kernel);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1769,7 +1772,7 @@ long do_fork(unsigned long clone_flags,
 	      int __user *child_tidptr)
 {
 	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+			parent_tidptr, child_tidptr, 0, 0);
 }
 #endif
 
@@ -1779,14 +1782,14 @@ long do_fork(unsigned long clone_flags,
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
 	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+		(unsigned long)arg, NULL, NULL, 0, 0);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0, 0);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -1798,7 +1801,7 @@ SYSCALL_DEFINE0(fork)
 SYSCALL_DEFINE0(vfork)
 {
 	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+			0, NULL, NULL, 0, 0);
 }
 #endif
 
@@ -1826,7 +1829,8 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
+			tls, 0);
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/3] Run dump pipe in container's namespace
  2016-03-18 12:48 ` Zhao Lei
@ 2016-03-18 12:48     ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

In current system, when we set core_pattern to a pipe, both pipe program
and program's output are in host's filesystem.

For example, when we set following core_pattern:
 # echo "|/my_dump_pipe %s %c %p %u %g %t e" >/proc/sys/kernel/core_pattern
and trigger a segment fault in a container, my_dump_pipe is searched from
host's filesystem, and it will write coredump into host's filesystem too.

In a privileged container, user can destroy host system by following
command:
 # # In a container
 # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
 # make_dump

Actually, all operation in a container should not change host's
environment, the container should use core_pattern as its private setting.
In detail, in core dump action:
1: Search pipe program in container's fs namespace.
2: Run pipe program in container's fs namespace to write coredump to it.

This patch fixed above problem running pipe program in user process's
context instead of kthread.

Test:
 # ################
 # # In host's system
 # ################
 #
 # ulimit -c 1024000
 # echo "|/dump_pipe" >/proc/sys/kernel/core_pattern
 # cat /dump_pipe
 #!/bin/sh
 cat >/tmp/host_dump_$1_$2_$3_$4_$5_$6
 # rm -f /tmp/*dump*
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l /tmp/*dump*
 -rw-r--r-- 1 root root 331776 Mar 16 16:57 /tmp/host_dump______
 #
 # lxc-start -n vm01
 #
 # ################
 # # In guest's system:
 # ################
 #
 # cat /proc/sys/kernel/core_pattern
 |/dump_pipe
 # cat /dump_pipe
 #!/bin/sh
 cat >/tmp/guest_dump_$1_$2_$3_$4_$5_$6
 # rm -f /tmp/*dump*
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l /tmp/*dump*
 -rw-r--r--    1 root     root       331776 Mar 16 09:02 /tmp/guest_dump______
 #

Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 fs/coredump.c         | 76 +++++++++++++++++++++++++++++++--------------------
 include/linux/sched.h |  1 +
 kernel/fork.c         |  6 ++++
 3 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 9ea87e9..863c23a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -496,33 +496,50 @@ static void wait_for_dump_helpers(struct file *file)
 	pipe_unlock(pipe);
 }
 
-/*
- * umh_pipe_setup
- * helper function to customize the process used
- * to collect the core in userspace.  Specifically
- * it sets up a pipe and installs it as fd 0 (stdin)
- * for the process.  Returns 0 on success, or
- * PTR_ERR on failure.
- * Note that it also sets the core limit to 1.  This
- * is a special value that we use to trap recursive
- * core dumps
- */
-static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
+struct pipeprg_data {
+	char **argv;
+	struct coredump_params *cp;
+};
+
+static int fork_callback(void *data)
 {
+	struct pipeprg_data *ppd = (struct pipeprg_data *)data;
 	struct file *files[2];
-	struct coredump_params *cp = (struct coredump_params *)info->data;
-	int err = create_pipe_files(files, 0);
-	if (err)
-		return err;
+	int ret;
+
+	/*
+	 * Sets up a pipe and installs it as fd 0 (stdin)
+	 * for the process.
+	 */
+	ret = create_pipe_files(files, 0);
+	if (ret)
+		do_exit(0);
 
-	cp->file = files[1];
+	ppd->cp->file = files[1];
 
-	err = replace_fd(0, files[0], 0);
+	ret = replace_fd(0, files[0], 0);
 	fput(files[0]);
-	/* and disallow core files too */
+	if (ret < 0)
+		do_exit(0);
+
+	/*
+	 * Sets the core limit to 1.  This
+	 * is a special value that we use to trap recursive
+	 * core dumps
+	 */
 	current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
 
-	return err;
+	set_fs(KERNEL_DS);
+	ret = do_execve(getname_kernel(ppd->argv[0]),
+			(const char __user *const __user *)ppd->argv,
+			(const char __user *const __user *)NULL);
+	if (ret) {
+		printk(KERN_WARNING "execute pipe program failed: %s ret=%d\n",
+		       ppd->argv[0], ret);
+		do_exit(0);
+	}
+
+	return ret;
 }
 
 void do_coredump(const siginfo_t *siginfo)
@@ -586,7 +603,8 @@ void do_coredump(const siginfo_t *siginfo)
 	if (ispipe) {
 		int dump_count;
 		char **helper_argv;
-		struct subprocess_info *sub_info;
+		struct pipeprg_data ppd;
+		pid_t pid;
 
 		if (ispipe < 0) {
 			printk(KERN_WARNING "format_corename failed\n");
@@ -633,19 +651,17 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
-		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_pipe_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
+		ppd.argv = helper_argv;
+		ppd.cp = &cprm;
 
+		pid = user_thread(fork_callback, &ppd,
+				  CLONE_VFORK | CLONE_UNTRACED);
 		argv_free(helper_argv);
-		if (retval) {
+		if (pid < 0) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
 			       cn.corename);
-			goto close_fail;
+			retval = pid;
+			goto fail_dropcount;
 		}
 	} else {
 		struct inode *inode;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56401e4..a1893f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2649,6 +2649,7 @@ extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
+extern pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags);
 
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 static inline void set_task_comm(struct task_struct *tsk, const char *from)
diff --git a/kernel/fork.c b/kernel/fork.c
index 643a09b..71b3339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1785,6 +1785,12 @@ pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 		(unsigned long)arg, NULL, NULL, 0, 0);
 }
 
+pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags)
+{
+	return _do_fork(flags, (unsigned long)fn,
+		(unsigned long)arg, NULL, NULL, 0, 1);
+}
+
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/3] Run dump pipe in container's namespace
@ 2016-03-18 12:48     ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Eric W. Biederman, Mateusz Guzik, Zhao Lei

In current system, when we set core_pattern to a pipe, both pipe program
and program's output are in host's filesystem.

For example, when we set following core_pattern:
 # echo "|/my_dump_pipe %s %c %p %u %g %t e" >/proc/sys/kernel/core_pattern
and trigger a segment fault in a container, my_dump_pipe is searched from
host's filesystem, and it will write coredump into host's filesystem too.

In a privileged container, user can destroy host system by following
command:
 # # In a container
 # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
 # make_dump

Actually, all operation in a container should not change host's
environment, the container should use core_pattern as its private setting.
In detail, in core dump action:
1: Search pipe program in container's fs namespace.
2: Run pipe program in container's fs namespace to write coredump to it.

This patch fixed above problem running pipe program in user process's
context instead of kthread.

Test:
 # ################
 # # In host's system
 # ################
 #
 # ulimit -c 1024000
 # echo "|/dump_pipe" >/proc/sys/kernel/core_pattern
 # cat /dump_pipe
 #!/bin/sh
 cat >/tmp/host_dump_$1_$2_$3_$4_$5_$6
 # rm -f /tmp/*dump*
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l /tmp/*dump*
 -rw-r--r-- 1 root root 331776 Mar 16 16:57 /tmp/host_dump______
 #
 # lxc-start -n vm01
 #
 # ################
 # # In guest's system:
 # ################
 #
 # cat /proc/sys/kernel/core_pattern
 |/dump_pipe
 # cat /dump_pipe
 #!/bin/sh
 cat >/tmp/guest_dump_$1_$2_$3_$4_$5_$6
 # rm -f /tmp/*dump*
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l /tmp/*dump*
 -rw-r--r--    1 root     root       331776 Mar 16 09:02 /tmp/guest_dump______
 #

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/coredump.c         | 76 +++++++++++++++++++++++++++++++--------------------
 include/linux/sched.h |  1 +
 kernel/fork.c         |  6 ++++
 3 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 9ea87e9..863c23a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -496,33 +496,50 @@ static void wait_for_dump_helpers(struct file *file)
 	pipe_unlock(pipe);
 }
 
-/*
- * umh_pipe_setup
- * helper function to customize the process used
- * to collect the core in userspace.  Specifically
- * it sets up a pipe and installs it as fd 0 (stdin)
- * for the process.  Returns 0 on success, or
- * PTR_ERR on failure.
- * Note that it also sets the core limit to 1.  This
- * is a special value that we use to trap recursive
- * core dumps
- */
-static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
+struct pipeprg_data {
+	char **argv;
+	struct coredump_params *cp;
+};
+
+static int fork_callback(void *data)
 {
+	struct pipeprg_data *ppd = (struct pipeprg_data *)data;
 	struct file *files[2];
-	struct coredump_params *cp = (struct coredump_params *)info->data;
-	int err = create_pipe_files(files, 0);
-	if (err)
-		return err;
+	int ret;
+
+	/*
+	 * Sets up a pipe and installs it as fd 0 (stdin)
+	 * for the process.
+	 */
+	ret = create_pipe_files(files, 0);
+	if (ret)
+		do_exit(0);
 
-	cp->file = files[1];
+	ppd->cp->file = files[1];
 
-	err = replace_fd(0, files[0], 0);
+	ret = replace_fd(0, files[0], 0);
 	fput(files[0]);
-	/* and disallow core files too */
+	if (ret < 0)
+		do_exit(0);
+
+	/*
+	 * Sets the core limit to 1.  This
+	 * is a special value that we use to trap recursive
+	 * core dumps
+	 */
 	current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
 
-	return err;
+	set_fs(KERNEL_DS);
+	ret = do_execve(getname_kernel(ppd->argv[0]),
+			(const char __user *const __user *)ppd->argv,
+			(const char __user *const __user *)NULL);
+	if (ret) {
+		printk(KERN_WARNING "execute pipe program failed: %s ret=%d\n",
+		       ppd->argv[0], ret);
+		do_exit(0);
+	}
+
+	return ret;
 }
 
 void do_coredump(const siginfo_t *siginfo)
@@ -586,7 +603,8 @@ void do_coredump(const siginfo_t *siginfo)
 	if (ispipe) {
 		int dump_count;
 		char **helper_argv;
-		struct subprocess_info *sub_info;
+		struct pipeprg_data ppd;
+		pid_t pid;
 
 		if (ispipe < 0) {
 			printk(KERN_WARNING "format_corename failed\n");
@@ -633,19 +651,17 @@ void do_coredump(const siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
-		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_pipe_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
+		ppd.argv = helper_argv;
+		ppd.cp = &cprm;
 
+		pid = user_thread(fork_callback, &ppd,
+				  CLONE_VFORK | CLONE_UNTRACED);
 		argv_free(helper_argv);
-		if (retval) {
+		if (pid < 0) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
 			       cn.corename);
-			goto close_fail;
+			retval = pid;
+			goto fail_dropcount;
 		}
 	} else {
 		struct inode *inode;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56401e4..a1893f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2649,6 +2649,7 @@ extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *,
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
+extern pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags);
 
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 static inline void set_task_comm(struct task_struct *tsk, const char *from)
diff --git a/kernel/fork.c b/kernel/fork.c
index 643a09b..71b3339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1785,6 +1785,12 @@ pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 		(unsigned long)arg, NULL, NULL, 0, 0);
 }
 
+pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags)
+{
+	return _do_fork(flags, (unsigned long)fn,
+		(unsigned long)arg, NULL, NULL, 0, 1);
+}
+
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-18 12:48 ` Zhao Lei
@ 2016-03-18 12:48     ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Currently, each container shared one copy of coredump setting
with the host system, if host system changed the setting, each
running containers will be affected.

Moreover, it is not easy to let each container keeping their own
coredump setting.

We can use some workaround as pipe program to make the second
requirement possible, but it is not simple, and both host and
container are limited to set to fixed pipe program.
In one word, for host running contailer, we can't change core_pattern
anymore.
To make the problem more hard, if a host running more than one
container product, each product will try to snatch the global
coredump setting to fit their own requirement.

For container based on namespace design, it is good to allow
each container keeping their own coredump setting.

It will bring us following benefit:
1: Each container can change their own coredump setting
   based on operation on /proc/sys/kernel/core_pattern
2: Coredump setting changed in host will not affect
   running containers.
3: Support both case of "putting coredump in guest" and
   "putting curedump in host".

Each namespace-based software(lxc, docker, ..) can use this function
to custom their dump setting.

And this function makes each continer working as separate system,
it fit for design goal of namespace.

Test(in lxc):
 # In the host
 # ----------------
 # echo host_core >/proc/sys/kernel/core_pattern
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ulimit -c 1024000
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:02 host_core.2175
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

 # In the container
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # echo container_core >/proc/sys/kernel/core_pattern
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rwxr-xr-x    1 root     root       759731 Feb  4 10:45 make_dump
 -rw-------    1 root     root       331776 Feb  4 10:45 container_core.16
 #

 # Return to host
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ls
 host_core.2175  make_dump  make_dump.c
 # rm -f host_core.2175
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:49 host_core.2351
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

Signed-off-by: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 fs/coredump.c                 |  3 +--
 include/linux/pid_namespace.h |  2 ++
 kernel/pid.c                  |  1 +
 kernel/pid_namespace.c        |  3 +++
 kernel/sysctl.c               | 22 ++++++++++++++++------
 5 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 863c23a..e693877 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -46,7 +46,6 @@
 
 int core_uses_pid;
 unsigned int core_pipe_limit;
-char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 
 struct core_name {
@@ -183,7 +182,7 @@ put_exe_file:
 static int format_corename(struct core_name *cn, struct coredump_params *cprm)
 {
 	const struct cred *cred = current_cred();
-	const char *pat_ptr = core_pattern;
+	const char *pat_ptr = current->nsproxy->pid_ns_for_children->core_pattern;
 	int ispipe = (*pat_ptr == '|');
 	int pid_in_pattern = 0;
 	int err = 0;
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 918b117..a5af1e9 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/kref.h>
 #include <linux/ns_common.h>
+#include <linux/binfmts.h>
 
 struct pidmap {
        atomic_t nr_free;
@@ -45,6 +46,7 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+	char core_pattern[CORENAME_MAX_SIZE];
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index 4d73a83..c79c1d5 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+	.core_pattern = "core",
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a65ba13..16d6d21 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -123,6 +123,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	for (i = 1; i < PIDMAP_ENTRIES; i++)
 		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
+	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
+		sizeof(ns->core_pattern));
+
 	return ns;
 
 out_free_map:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd..70f8af5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -100,7 +100,6 @@
 extern int suid_dumpable;
 #ifdef CONFIG_COREDUMP
 extern int core_uses_pid;
-extern char core_pattern[];
 extern unsigned int core_pipe_limit;
 #endif
 extern int pid_max;
@@ -469,7 +468,7 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "core_pattern",
-		.data		= core_pattern,
+		.data		= NULL,
 		.maxlen		= CORENAME_MAX_SIZE,
 		.mode		= 0644,
 		.proc_handler	= proc_dostring_coredump,
@@ -2301,6 +2300,8 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 static void validate_coredump_safety(void)
 {
 #ifdef CONFIG_COREDUMP
+	char *core_pattern = current->nsproxy->pid_ns_for_children->core_pattern;
+
 	if (suid_dumpable == SUID_DUMP_ROOT &&
 	    core_pattern[0] != '/' && core_pattern[0] != '|') {
 		printk(KERN_WARNING "Unsafe core_pattern used with "\
@@ -2323,10 +2324,19 @@ static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 static int proc_dostring_coredump(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
-	if (!error)
-		validate_coredump_safety();
-	return error;
+	int ret;
+
+	if (write && *ppos && sysctl_writes_strict == SYSCTL_WRITES_WARN)
+		warn_sysctl_write(table);
+
+	ret = _proc_do_string(current->nsproxy->pid_ns_for_children->core_pattern,
+			      table->maxlen, write,
+			      (char __user *)buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	validate_coredump_safety();
+	return 0;
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-18 12:48     ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-18 12:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, Eric W. Biederman, Mateusz Guzik, Zhao Lei

Currently, each container shared one copy of coredump setting
with the host system, if host system changed the setting, each
running containers will be affected.

Moreover, it is not easy to let each container keeping their own
coredump setting.

We can use some workaround as pipe program to make the second
requirement possible, but it is not simple, and both host and
container are limited to set to fixed pipe program.
In one word, for host running contailer, we can't change core_pattern
anymore.
To make the problem more hard, if a host running more than one
container product, each product will try to snatch the global
coredump setting to fit their own requirement.

For container based on namespace design, it is good to allow
each container keeping their own coredump setting.

It will bring us following benefit:
1: Each container can change their own coredump setting
   based on operation on /proc/sys/kernel/core_pattern
2: Coredump setting changed in host will not affect
   running containers.
3: Support both case of "putting coredump in guest" and
   "putting curedump in host".

Each namespace-based software(lxc, docker, ..) can use this function
to custom their dump setting.

And this function makes each continer working as separate system,
it fit for design goal of namespace.

Test(in lxc):
 # In the host
 # ----------------
 # echo host_core >/proc/sys/kernel/core_pattern
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ulimit -c 1024000
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:02 host_core.2175
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

 # In the container
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # echo container_core >/proc/sys/kernel/core_pattern
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rwxr-xr-x    1 root     root       759731 Feb  4 10:45 make_dump
 -rw-------    1 root     root       331776 Feb  4 10:45 container_core.16
 #

 # Return to host
 # ----------------
 # cat /proc/sys/kernel/core_pattern
 host_core
 # ls
 host_core.2175  make_dump  make_dump.c
 # rm -f host_core.2175
 # ./make_dump
 Segmentation fault (core dumped)
 # ls -l
 -rw------- 1 root root 331776 Feb  4 18:49 host_core.2351
 -rwxr-xr-x 1 root root 759731 Feb  4 18:01 make_dump
 #

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
---
 fs/coredump.c                 |  3 +--
 include/linux/pid_namespace.h |  2 ++
 kernel/pid.c                  |  1 +
 kernel/pid_namespace.c        |  3 +++
 kernel/sysctl.c               | 22 ++++++++++++++++------
 5 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 863c23a..e693877 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -46,7 +46,6 @@
 
 int core_uses_pid;
 unsigned int core_pipe_limit;
-char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 
 struct core_name {
@@ -183,7 +182,7 @@ put_exe_file:
 static int format_corename(struct core_name *cn, struct coredump_params *cprm)
 {
 	const struct cred *cred = current_cred();
-	const char *pat_ptr = core_pattern;
+	const char *pat_ptr = current->nsproxy->pid_ns_for_children->core_pattern;
 	int ispipe = (*pat_ptr == '|');
 	int pid_in_pattern = 0;
 	int err = 0;
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 918b117..a5af1e9 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/kref.h>
 #include <linux/ns_common.h>
+#include <linux/binfmts.h>
 
 struct pidmap {
        atomic_t nr_free;
@@ -45,6 +46,7 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+	char core_pattern[CORENAME_MAX_SIZE];
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index 4d73a83..c79c1d5 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+	.core_pattern = "core",
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a65ba13..16d6d21 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -123,6 +123,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	for (i = 1; i < PIDMAP_ENTRIES; i++)
 		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
 
+	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
+		sizeof(ns->core_pattern));
+
 	return ns;
 
 out_free_map:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd..70f8af5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -100,7 +100,6 @@
 extern int suid_dumpable;
 #ifdef CONFIG_COREDUMP
 extern int core_uses_pid;
-extern char core_pattern[];
 extern unsigned int core_pipe_limit;
 #endif
 extern int pid_max;
@@ -469,7 +468,7 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "core_pattern",
-		.data		= core_pattern,
+		.data		= NULL,
 		.maxlen		= CORENAME_MAX_SIZE,
 		.mode		= 0644,
 		.proc_handler	= proc_dostring_coredump,
@@ -2301,6 +2300,8 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 static void validate_coredump_safety(void)
 {
 #ifdef CONFIG_COREDUMP
+	char *core_pattern = current->nsproxy->pid_ns_for_children->core_pattern;
+
 	if (suid_dumpable == SUID_DUMP_ROOT &&
 	    core_pattern[0] != '/' && core_pattern[0] != '|') {
 		printk(KERN_WARNING "Unsafe core_pattern used with "\
@@ -2323,10 +2324,19 @@ static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 static int proc_dostring_coredump(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	int error = proc_dostring(table, write, buffer, lenp, ppos);
-	if (!error)
-		validate_coredump_safety();
-	return error;
+	int ret;
+
+	if (write && *ppos && sysctl_writes_strict == SYSCTL_WRITES_WARN)
+		warn_sysctl_write(table);
+
+	ret = _proc_do_string(current->nsproxy->pid_ns_for_children->core_pattern,
+			      table->maxlen, write,
+			      (char __user *)buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	validate_coredump_safety();
+	return 0;
 }
 #endif
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/3] Run dump pipe in container's namespace
  2016-03-18 12:48     ` Zhao Lei
@ 2016-03-18 14:06         ` kbuild test robot
  -1 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-03-18 14:06 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	kbuild-all-JC7UmRfGjtg

[-- Attachment #1: Type: text/plain, Size: 3041 bytes --]

Hi Zhao,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.5 next-20160318]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Zhao-Lei/Make-core_pattern-support-namespace/20160318-205506
config: um-x86_64_defconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=um SUBARCH=x86_64

All errors (new ones prefixed by >>):

   kernel/time/Kconfig:155:warning: range is invalid
   In file included from arch/x86/um/asm/elf.h:9:0,
                    from include/linux/elf.h:4,
                    from arch/x86/um/shared/sysdep/kernel-offsets.h:3,
                    from arch/um/kernel/asm-offsets.c:1:
>> arch/um/include/shared/skas/skas.h:13:12: error: conflicting types for 'user_thread'
    extern int user_thread(unsigned long stack, int flags);
               ^
   In file included from arch/x86/um/shared/sysdep/kernel-offsets.h:2:0,
                    from arch/um/kernel/asm-offsets.c:1:
   include/linux/sched.h:2652:14: note: previous declaration of 'user_thread' was here
    extern pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags);
                 ^
   make[2]: *** [arch/um/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +/user_thread +13 arch/um/include/shared/skas/skas.h

^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16   7  #define __SKAS_H
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16   8  
37185b3324 arch/um/include/shared/skas/skas.h Al Viro        2012-10-08   9  #include <sysdep/ptrace.h>
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  10  
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  11  extern int userspace_pid[];
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  12  
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16 @13  extern int user_thread(unsigned long stack, int flags);
3c91735099 arch/um/include/skas/skas.h        Jeff Dike      2006-09-27  14  extern void new_thread_handler(void);
77bf440031 arch/um/include/skas/skas.h        Jeff Dike      2007-10-16  15  extern void handle_syscall(struct uml_pt_regs *regs);
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  16  extern long execute_syscall_skas(void *r);

:::::: The code at line 13 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds-gWtpgVMusWVb5UGfqNBoRg@public.gmane.org>
:::::: CC: Linus Torvalds <torvalds-gWtpgVMusWVb5UGfqNBoRg@public.gmane.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 7112 bytes --]

[-- Attachment #3: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/3] Run dump pipe in container's namespace
@ 2016-03-18 14:06         ` kbuild test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2016-03-18 14:06 UTC (permalink / raw)
  To: Zhao Lei
  Cc: kbuild-all, linux-kernel, containers, Eric W. Biederman,
	Mateusz Guzik, Zhao Lei

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

Hi Zhao,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.5 next-20160318]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Zhao-Lei/Make-core_pattern-support-namespace/20160318-205506
config: um-x86_64_defconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=um SUBARCH=x86_64

All errors (new ones prefixed by >>):

   kernel/time/Kconfig:155:warning: range is invalid
   In file included from arch/x86/um/asm/elf.h:9:0,
                    from include/linux/elf.h:4,
                    from arch/x86/um/shared/sysdep/kernel-offsets.h:3,
                    from arch/um/kernel/asm-offsets.c:1:
>> arch/um/include/shared/skas/skas.h:13:12: error: conflicting types for 'user_thread'
    extern int user_thread(unsigned long stack, int flags);
               ^
   In file included from arch/x86/um/shared/sysdep/kernel-offsets.h:2:0,
                    from arch/um/kernel/asm-offsets.c:1:
   include/linux/sched.h:2652:14: note: previous declaration of 'user_thread' was here
    extern pid_t user_thread(int (*fn)(void *), void *arg, unsigned long flags);
                 ^
   make[2]: *** [arch/um/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +/user_thread +13 arch/um/include/shared/skas/skas.h

^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16   7  #define __SKAS_H
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16   8  
37185b3324 arch/um/include/shared/skas/skas.h Al Viro        2012-10-08   9  #include <sysdep/ptrace.h>
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  10  
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  11  extern int userspace_pid[];
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  12  
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16 @13  extern int user_thread(unsigned long stack, int flags);
3c91735099 arch/um/include/skas/skas.h        Jeff Dike      2006-09-27  14  extern void new_thread_handler(void);
77bf440031 arch/um/include/skas/skas.h        Jeff Dike      2007-10-16  15  extern void handle_syscall(struct uml_pt_regs *regs);
^1da177e4c arch/um/kernel/skas/include/skas.h Linus Torvalds 2005-04-16  16  extern long execute_syscall_skas(void *r);

:::::: The code at line 13 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 7112 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-18 12:48     ` Zhao Lei
@ 2016-03-21  6:00         ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21  6:00 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> Currently, each container shared one copy of coredump setting
> with the host system, if host system changed the setting, each
> running containers will be affected.
>
> Moreover, it is not easy to let each container keeping their own
> coredump setting.
>
> We can use some workaround as pipe program to make the second
> requirement possible, but it is not simple, and both host and
> container are limited to set to fixed pipe program.
> In one word, for host running contailer, we can't change core_pattern
> anymore.
> To make the problem more hard, if a host running more than one
> container product, each product will try to snatch the global
> coredump setting to fit their own requirement.
>
> For container based on namespace design, it is good to allow
> each container keeping their own coredump setting.
>
> It will bring us following benefit:
> 1: Each container can change their own coredump setting
>    based on operation on /proc/sys/kernel/core_pattern
> 2: Coredump setting changed in host will not affect
>    running containers.
> 3: Support both case of "putting coredump in guest" and
>    "putting curedump in host".
>
> Each namespace-based software(lxc, docker, ..) can use this function
> to custom their dump setting.
>
> And this function makes each continer working as separate system,
> it fit for design goal of namespace

There are a lot of questionable things with this patchset.

> @@ -183,7 +182,7 @@ put_exe_file:
>  static int format_corename(struct core_name *cn, struct coredump_params *cprm)
>  {
>  	const struct cred *cred = current_cred();
> -	const char *pat_ptr = core_pattern;
> +	const char *pat_ptr = current->nsproxy->pid_ns_for_children->core_pattern;

current->nsproxy->pid_ns_for_children as the name implies is completely
inappropriate for getting the pid namespace of the current task.

This should use task_active_pid_namespace.

>  	int ispipe = (*pat_ptr == '|');
>  	int pid_in_pattern = 0;
>  	int err = 0;
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 918b117..a5af1e9 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,6 +9,7 @@
>  #include <linux/nsproxy.h>
>  #include <linux/kref.h>
>  #include <linux/ns_common.h>
> +#include <linux/binfmts.h>
>  
>  struct pidmap {
>         atomic_t nr_free;
> @@ -45,6 +46,7 @@ struct pid_namespace {
>  	int hide_pid;
>  	int reboot;	/* group exit code if this pidns was rebooted */
>  	struct ns_common ns;
> +	char core_pattern[CORENAME_MAX_SIZE];
>  };
>  
>  extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 4d73a83..c79c1d5 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
>  #ifdef CONFIG_PID_NS
>  	.ns.ops = &pidns_operations,
>  #endif
> +	.core_pattern = "core",
>  };
>  EXPORT_SYMBOL_GPL(init_pid_ns);
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a65ba13..16d6d21 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -123,6 +123,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	for (i = 1; i < PIDMAP_ENTRIES; i++)
>  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>  
> +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> +		sizeof(ns->core_pattern));
> +

This is pretty horrible.  You are giving unprivileged processes the
ability to run an already specified core dump helper in a pid namespace
of their choosing.

That is not backwards compatible, and it is possible this can lead to
privilege escalation by triciking a privileged dump process to do
something silly because it is running in the wrong pid namespace.

Similarly the entire concept of forking from the program dumping core
suffers from the same problem but for all other namespaces.

I was hoping that I would see a justification somewhere in the patch
descriptions describing why this set of decisions could be safe.  I do
not and so I assume this case was not considered.

If you had managed to fork for the child_reaper of the pid_namespace
that set the core pattern (as has been suggested) there would be some
chance that things would work correctly.    As you are forking from the
program actually dumping core I see no chance that this patchset is
either safe or backwards compatible as currently written.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-21  6:00         ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21  6:00 UTC (permalink / raw)
  To: Zhao Lei; +Cc: linux-kernel, containers, Mateusz Guzik

Zhao Lei <zhaolei@cn.fujitsu.com> writes:

> Currently, each container shared one copy of coredump setting
> with the host system, if host system changed the setting, each
> running containers will be affected.
>
> Moreover, it is not easy to let each container keeping their own
> coredump setting.
>
> We can use some workaround as pipe program to make the second
> requirement possible, but it is not simple, and both host and
> container are limited to set to fixed pipe program.
> In one word, for host running contailer, we can't change core_pattern
> anymore.
> To make the problem more hard, if a host running more than one
> container product, each product will try to snatch the global
> coredump setting to fit their own requirement.
>
> For container based on namespace design, it is good to allow
> each container keeping their own coredump setting.
>
> It will bring us following benefit:
> 1: Each container can change their own coredump setting
>    based on operation on /proc/sys/kernel/core_pattern
> 2: Coredump setting changed in host will not affect
>    running containers.
> 3: Support both case of "putting coredump in guest" and
>    "putting curedump in host".
>
> Each namespace-based software(lxc, docker, ..) can use this function
> to custom their dump setting.
>
> And this function makes each continer working as separate system,
> it fit for design goal of namespace

There are a lot of questionable things with this patchset.

> @@ -183,7 +182,7 @@ put_exe_file:
>  static int format_corename(struct core_name *cn, struct coredump_params *cprm)
>  {
>  	const struct cred *cred = current_cred();
> -	const char *pat_ptr = core_pattern;
> +	const char *pat_ptr = current->nsproxy->pid_ns_for_children->core_pattern;

current->nsproxy->pid_ns_for_children as the name implies is completely
inappropriate for getting the pid namespace of the current task.

This should use task_active_pid_namespace.

>  	int ispipe = (*pat_ptr == '|');
>  	int pid_in_pattern = 0;
>  	int err = 0;
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 918b117..a5af1e9 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,6 +9,7 @@
>  #include <linux/nsproxy.h>
>  #include <linux/kref.h>
>  #include <linux/ns_common.h>
> +#include <linux/binfmts.h>
>  
>  struct pidmap {
>         atomic_t nr_free;
> @@ -45,6 +46,7 @@ struct pid_namespace {
>  	int hide_pid;
>  	int reboot;	/* group exit code if this pidns was rebooted */
>  	struct ns_common ns;
> +	char core_pattern[CORENAME_MAX_SIZE];
>  };
>  
>  extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 4d73a83..c79c1d5 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
>  #ifdef CONFIG_PID_NS
>  	.ns.ops = &pidns_operations,
>  #endif
> +	.core_pattern = "core",
>  };
>  EXPORT_SYMBOL_GPL(init_pid_ns);
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a65ba13..16d6d21 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -123,6 +123,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	for (i = 1; i < PIDMAP_ENTRIES; i++)
>  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>  
> +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> +		sizeof(ns->core_pattern));
> +

This is pretty horrible.  You are giving unprivileged processes the
ability to run an already specified core dump helper in a pid namespace
of their choosing.

That is not backwards compatible, and it is possible this can lead to
privilege escalation by triciking a privileged dump process to do
something silly because it is running in the wrong pid namespace.

Similarly the entire concept of forking from the program dumping core
suffers from the same problem but for all other namespaces.

I was hoping that I would see a justification somewhere in the patch
descriptions describing why this set of decisions could be safe.  I do
not and so I assume this case was not considered.

If you had managed to fork for the child_reaper of the pid_namespace
that set the core pattern (as has been suggested) there would be some
chance that things would work correctly.    As you are forking from the
program actually dumping core I see no chance that this patchset is
either safe or backwards compatible as currently written.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-21  6:00         ` Eric W. Biederman
@ 2016-03-21  7:16             ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-21  7:16 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi, Eric W. Biederman

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> Sent: Monday, March 21, 2016 2:00 PM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> Mateusz Guzik <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> 
> > Currently, each container shared one copy of coredump setting
> > with the host system, if host system changed the setting, each
> > running containers will be affected.
> >
> > Moreover, it is not easy to let each container keeping their own
> > coredump setting.
> >
> > We can use some workaround as pipe program to make the second
> > requirement possible, but it is not simple, and both host and
> > container are limited to set to fixed pipe program.
> > In one word, for host running contailer, we can't change core_pattern
> > anymore.
> > To make the problem more hard, if a host running more than one
> > container product, each product will try to snatch the global
> > coredump setting to fit their own requirement.
> >
> > For container based on namespace design, it is good to allow
> > each container keeping their own coredump setting.
> >
> > It will bring us following benefit:
> > 1: Each container can change their own coredump setting
> >    based on operation on /proc/sys/kernel/core_pattern
> > 2: Coredump setting changed in host will not affect
> >    running containers.
> > 3: Support both case of "putting coredump in guest" and
> >    "putting curedump in host".
> >
> > Each namespace-based software(lxc, docker, ..) can use this function
> > to custom their dump setting.
> >
> > And this function makes each continer working as separate system,
> > it fit for design goal of namespace
> 
> There are a lot of questionable things with this patchset.
> 
> > @@ -183,7 +182,7 @@ put_exe_file:
> >  static int format_corename(struct core_name *cn, struct
> coredump_params *cprm)
> >  {
> >  	const struct cred *cred = current_cred();
> > -	const char *pat_ptr = core_pattern;
> > +	const char *pat_ptr =
> current->nsproxy->pid_ns_for_children->core_pattern;
> 
> current->nsproxy->pid_ns_for_children as the name implies is completely
> inappropriate for getting the pid namespace of the current task.
> 
> This should use task_active_pid_namespace.
> 
In 5 members in nsproxy struct, pid_ns_for_children seems the best place
for this variable.

And no variable named task_active_pid_namespace in source,
could you explain it deeply?

> >  	int ispipe = (*pat_ptr == '|');
> >  	int pid_in_pattern = 0;
> >  	int err = 0;
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 918b117..a5af1e9 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/nsproxy.h>
> >  #include <linux/kref.h>
> >  #include <linux/ns_common.h>
> > +#include <linux/binfmts.h>
> >
> >  struct pidmap {
> >         atomic_t nr_free;
> > @@ -45,6 +46,7 @@ struct pid_namespace {
> >  	int hide_pid;
> >  	int reboot;	/* group exit code if this pidns was rebooted */
> >  	struct ns_common ns;
> > +	char core_pattern[CORENAME_MAX_SIZE];
> >  };
> >
> >  extern struct pid_namespace init_pid_ns;
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 4d73a83..c79c1d5 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
> >  #ifdef CONFIG_PID_NS
> >  	.ns.ops = &pidns_operations,
> >  #endif
> > +	.core_pattern = "core",
> >  };
> >  EXPORT_SYMBOL_GPL(init_pid_ns);
> >
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index a65ba13..16d6d21 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -123,6 +123,9 @@ static struct pid_namespace
> *create_pid_namespace(struct user_namespace *user_ns
> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
> >
> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> > +		sizeof(ns->core_pattern));
> > +
> 
> This is pretty horrible.  You are giving unprivileged processes the
> ability to run an already specified core dump helper in a pid namespace
> of their choosing.
> 
Similar problem before patch.
In piped core_pattern setting, any panic process will trigger a
running of core_dump process.

Comparing to current code, current code maybe more horrible,
the guest can destroy host system, and after this patch, the guest
can only destroy itself.
(As the script in patch description)

Actually it is not so horrible, only the root user can modify code_pattern,
and normal user/process have no chance to do bad thing.

> That is not backwards compatible, and it is possible this can lead to
> privilege escalation by triciking a privileged dump process to do
> something silly because it is running in the wrong pid namespace.
> 
In current code, the dump process is forking from kernel thread,
it is in a most-privileged namespace, dumping contents into host's fs,
it really cause problem.
Compare to current code, running dump process in container's
namespace maybe the right way.

The only thing this patch do is letting dump program running in
container's namespace instead of host.

> Similarly the entire concept of forking from the program dumping core
> suffers from the same problem but for all other namespaces.
> 
> I was hoping that I would see a justification somewhere in the patch
> descriptions describing why this set of decisions could be safe.  I do
> not and so I assume this case was not considered.
> 
> If you had managed to fork for the child_reaper of the pid_namespace
> that set the core pattern (as has been suggested) there would be some
> chance that things would work correctly.
Do you mean do fork in kthread(who is running in host's namespace, as corrent code)
with some special operation to change new thread running in container's
namespace?

> As you are forking from the program actually dumping core I see no
> chance that this patchset is either safe or backwards compatible as
> currently written.
> 
Current code have obvious problem, this forking new thread in container's
namespace is nothing but safe than host's namespace.
At least we need to solve the problem descripted in script in patch
description.

The only thing is backwards compatible, as our discussion in v1 patch,
it is the thing we need to change.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-21  7:16             ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-21  7:16 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Hi, Eric W. Biederman

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Monday, March 21, 2016 2:00 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> Mateusz Guzik <mguzik@redhat.com>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> 
> > Currently, each container shared one copy of coredump setting
> > with the host system, if host system changed the setting, each
> > running containers will be affected.
> >
> > Moreover, it is not easy to let each container keeping their own
> > coredump setting.
> >
> > We can use some workaround as pipe program to make the second
> > requirement possible, but it is not simple, and both host and
> > container are limited to set to fixed pipe program.
> > In one word, for host running contailer, we can't change core_pattern
> > anymore.
> > To make the problem more hard, if a host running more than one
> > container product, each product will try to snatch the global
> > coredump setting to fit their own requirement.
> >
> > For container based on namespace design, it is good to allow
> > each container keeping their own coredump setting.
> >
> > It will bring us following benefit:
> > 1: Each container can change their own coredump setting
> >    based on operation on /proc/sys/kernel/core_pattern
> > 2: Coredump setting changed in host will not affect
> >    running containers.
> > 3: Support both case of "putting coredump in guest" and
> >    "putting curedump in host".
> >
> > Each namespace-based software(lxc, docker, ..) can use this function
> > to custom their dump setting.
> >
> > And this function makes each continer working as separate system,
> > it fit for design goal of namespace
> 
> There are a lot of questionable things with this patchset.
> 
> > @@ -183,7 +182,7 @@ put_exe_file:
> >  static int format_corename(struct core_name *cn, struct
> coredump_params *cprm)
> >  {
> >  	const struct cred *cred = current_cred();
> > -	const char *pat_ptr = core_pattern;
> > +	const char *pat_ptr =
> current->nsproxy->pid_ns_for_children->core_pattern;
> 
> current->nsproxy->pid_ns_for_children as the name implies is completely
> inappropriate for getting the pid namespace of the current task.
> 
> This should use task_active_pid_namespace.
> 
In 5 members in nsproxy struct, pid_ns_for_children seems the best place
for this variable.

And no variable named task_active_pid_namespace in source,
could you explain it deeply?

> >  	int ispipe = (*pat_ptr == '|');
> >  	int pid_in_pattern = 0;
> >  	int err = 0;
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 918b117..a5af1e9 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/nsproxy.h>
> >  #include <linux/kref.h>
> >  #include <linux/ns_common.h>
> > +#include <linux/binfmts.h>
> >
> >  struct pidmap {
> >         atomic_t nr_free;
> > @@ -45,6 +46,7 @@ struct pid_namespace {
> >  	int hide_pid;
> >  	int reboot;	/* group exit code if this pidns was rebooted */
> >  	struct ns_common ns;
> > +	char core_pattern[CORENAME_MAX_SIZE];
> >  };
> >
> >  extern struct pid_namespace init_pid_ns;
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 4d73a83..c79c1d5 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
> >  #ifdef CONFIG_PID_NS
> >  	.ns.ops = &pidns_operations,
> >  #endif
> > +	.core_pattern = "core",
> >  };
> >  EXPORT_SYMBOL_GPL(init_pid_ns);
> >
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index a65ba13..16d6d21 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -123,6 +123,9 @@ static struct pid_namespace
> *create_pid_namespace(struct user_namespace *user_ns
> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
> >
> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> > +		sizeof(ns->core_pattern));
> > +
> 
> This is pretty horrible.  You are giving unprivileged processes the
> ability to run an already specified core dump helper in a pid namespace
> of their choosing.
> 
Similar problem before patch.
In piped core_pattern setting, any panic process will trigger a
running of core_dump process.

Comparing to current code, current code maybe more horrible,
the guest can destroy host system, and after this patch, the guest
can only destroy itself.
(As the script in patch description)

Actually it is not so horrible, only the root user can modify code_pattern,
and normal user/process have no chance to do bad thing.

> That is not backwards compatible, and it is possible this can lead to
> privilege escalation by triciking a privileged dump process to do
> something silly because it is running in the wrong pid namespace.
> 
In current code, the dump process is forking from kernel thread,
it is in a most-privileged namespace, dumping contents into host's fs,
it really cause problem.
Compare to current code, running dump process in container's
namespace maybe the right way.

The only thing this patch do is letting dump program running in
container's namespace instead of host.

> Similarly the entire concept of forking from the program dumping core
> suffers from the same problem but for all other namespaces.
> 
> I was hoping that I would see a justification somewhere in the patch
> descriptions describing why this set of decisions could be safe.  I do
> not and so I assume this case was not considered.
> 
> If you had managed to fork for the child_reaper of the pid_namespace
> that set the core pattern (as has been suggested) there would be some
> chance that things would work correctly.
Do you mean do fork in kthread(who is running in host's namespace, as corrent code)
with some special operation to change new thread running in container's
namespace?

> As you are forking from the program actually dumping core I see no
> chance that this patchset is either safe or backwards compatible as
> currently written.
> 
Current code have obvious problem, this forking new thread in container's
namespace is nothing but safe than host's namespace.
At least we need to solve the problem descripted in script in patch
description.

The only thing is backwards compatible, as our discussion in v1 patch,
it is the thing we need to change.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-21  7:16             ` Zhao Lei
@ 2016-03-21  8:14               ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21  8:14 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> Hi, Eric W. Biederman
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> Sent: Monday, March 21, 2016 2:00 PM
>> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
>> Mateusz Guzik <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
>> 
>> > Currently, each container shared one copy of coredump setting
>> > with the host system, if host system changed the setting, each
>> > running containers will be affected.
>> >
>> > Moreover, it is not easy to let each container keeping their own
>> > coredump setting.
>> >
>> > We can use some workaround as pipe program to make the second
>> > requirement possible, but it is not simple, and both host and
>> > container are limited to set to fixed pipe program.
>> > In one word, for host running contailer, we can't change core_pattern
>> > anymore.
>> > To make the problem more hard, if a host running more than one
>> > container product, each product will try to snatch the global
>> > coredump setting to fit their own requirement.
>> >
>> > For container based on namespace design, it is good to allow
>> > each container keeping their own coredump setting.
>> >
>> > It will bring us following benefit:
>> > 1: Each container can change their own coredump setting
>> >    based on operation on /proc/sys/kernel/core_pattern
>> > 2: Coredump setting changed in host will not affect
>> >    running containers.
>> > 3: Support both case of "putting coredump in guest" and
>> >    "putting curedump in host".
>> >
>> > Each namespace-based software(lxc, docker, ..) can use this function
>> > to custom their dump setting.
>> >
>> > And this function makes each continer working as separate system,
>> > it fit for design goal of namespace
>> 
>> There are a lot of questionable things with this patchset.
>> 
>> > @@ -183,7 +182,7 @@ put_exe_file:
>> >  static int format_corename(struct core_name *cn, struct
>> coredump_params *cprm)
>> >  {
>> >  	const struct cred *cred = current_cred();
>> > -	const char *pat_ptr = core_pattern;
>> > +	const char *pat_ptr =
>> current->nsproxy->pid_ns_for_children->core_pattern;
>> 
>> current->nsproxy->pid_ns_for_children as the name implies is completely
>> inappropriate for getting the pid namespace of the current task.
>> 
>> This should use task_active_pid_namespace.
>> 
> In 5 members in nsproxy struct, pid_ns_for_children seems the best place
> for this variable.

nsproxy is not a magic list of namespaces, nsproxy is to keep
task_struct from expanding and as a trick to keep reference count
increments for namespaces cheap.

> And no variable named task_active_pid_namespace in source,
> could you explain it deeply?

Apologies I mispelled it.  Look in pid_namespace.h at
task_active_pid_ns.  If you want a tasks pid namespace that is the
function to use.

pid_ns_for_children only describes newly forked children.  Which leads
to another problem of your patchset.  I can force your coredump helper
into a pid namespace that the program that dumps core will create it's
children in.

>> >  	int ispipe = (*pat_ptr == '|');
>> >  	int pid_in_pattern = 0;
>> >  	int err = 0;
>> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> > index 918b117..a5af1e9 100644
>> > --- a/include/linux/pid_namespace.h
>> > +++ b/include/linux/pid_namespace.h
>> > @@ -9,6 +9,7 @@
>> >  #include <linux/nsproxy.h>
>> >  #include <linux/kref.h>
>> >  #include <linux/ns_common.h>
>> > +#include <linux/binfmts.h>
>> >
>> >  struct pidmap {
>> >         atomic_t nr_free;
>> > @@ -45,6 +46,7 @@ struct pid_namespace {
>> >  	int hide_pid;
>> >  	int reboot;	/* group exit code if this pidns was rebooted */
>> >  	struct ns_common ns;
>> > +	char core_pattern[CORENAME_MAX_SIZE];
>> >  };
>> >
>> >  extern struct pid_namespace init_pid_ns;
>> > diff --git a/kernel/pid.c b/kernel/pid.c
>> > index 4d73a83..c79c1d5 100644
>> > --- a/kernel/pid.c
>> > +++ b/kernel/pid.c
>> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
>> >  #ifdef CONFIG_PID_NS
>> >  	.ns.ops = &pidns_operations,
>> >  #endif
>> > +	.core_pattern = "core",
>> >  };
>> >  EXPORT_SYMBOL_GPL(init_pid_ns);
>> >
>> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> > index a65ba13..16d6d21 100644
>> > --- a/kernel/pid_namespace.c
>> > +++ b/kernel/pid_namespace.c
>> > @@ -123,6 +123,9 @@ static struct pid_namespace
>> *create_pid_namespace(struct user_namespace *user_ns
>> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
>> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>> >
>> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
>> > +		sizeof(ns->core_pattern));
>> > +
>> 
>> This is pretty horrible.  You are giving unprivileged processes the
>> ability to run an already specified core dump helper in a pid namespace
>> of their choosing.
>> 
> Similar problem before patch.
> In piped core_pattern setting, any panic process will trigger a
> running of core_dump process.

That is not at all alike.

> Comparing to current code, current code maybe more horrible,
> the guest can destroy host system, and after this patch, the guest
> can only destroy itself.
> (As the script in patch description)

Only if the host is configured to allow itself to be stomped on.  That
is completely a host configured setting.  Yes the host can configure
itself in a way that can cause problems.  But that is the hosts problem.

> Actually it is not so horrible, only the root user can modify code_pattern,
> and normal user/process have no chance to do bad thing.

The argument that a bad design is not bad because only root can do X
does not fly anymore, especially in the presence of containers.

>> That is not backwards compatible, and it is possible this can lead to
>> privilege escalation by triciking a privileged dump process to do
>> something silly because it is running in the wrong pid namespace.
>> 
> In current code, the dump process is forking from kernel thread,
> it is in a most-privileged namespace, dumping contents into host's fs,
> it really cause problem.

It only dumps into the host's fs if it is configured that way.  When you
point a gun at your foot and pull the trigger that really causes
problems as well.  That doesn't make it the gun's problem that it can be
pointed at feet.

> Compare to current code, running dump process in container's
> namespace maybe the right way.
>
> The only thing this patch do is letting dump program running in
> container's namespace instead of host.

The ability to trick a more privileged program to do the wrong thing is
most definitely more than only letting the dump program run in the
container's namespace.  This can be everything up to including getting
root outside of the container.

I will admit that if used with a user namespace what you are doing is
not the worst possible thing to do but it is very definitely a mess.

>> Similarly the entire concept of forking from the program dumping core
>> suffers from the same problem but for all other namespaces.
>> 
>> I was hoping that I would see a justification somewhere in the patch
>> descriptions describing why this set of decisions could be safe.  I do
>> not and so I assume this case was not considered.
>> 
>> If you had managed to fork for the child_reaper of the pid_namespace
>> that set the core pattern (as has been suggested) there would be some
>> chance that things would work correctly.
> Do you mean do fork in kthread(who is running in host's namespace, as corrent code)
> with some special operation to change new thread running in container's
> namespace?
>
>> As you are forking from the program actually dumping core I see no
>> chance that this patchset is either safe or backwards compatible as
>> currently written.
>> 
> Current code have obvious problem, this forking new thread in container's
> namespace is nothing but safe than host's namespace.
> At least we need to solve the problem descripted in script in patch
> description.

The current code has obvious limitations.  But you can in userspace
accomplish mush everything you hope to accomplish here, as all of the
information is available.  It just requires coopeartion.

> The only thing is backwards compatible, as our discussion in v1 patch,
> it is the thing we need to change.

*Laughs*

This code is so absurd in handling the weird cases that I was hoping
that someone else would point out how very very bad it is.  The possible
solutions to the problems have already been discussed to ad nasium and
you have not used any of those solutions.  Although it does seem the
kbuild test robot was not too bad in point out how poor your testing of
these patches was.

At the end of the day I can break any number of current setups with your
patches.  Then there are the security implication of confusing
privileged or somewhat privileged dumping programs.   On the one hand
your patches are not giving the core dumping program enough privielges
to write core dumps, on the other hand you are making it possible to
confuse at least set uid root core dump helpers, leading to privilege
escalations.

In no case does what happens during a core dump follow any version of
the principle of least surprise.


I will agree that the problem you are trying to solve is a pain point,
but as I have said before it is a pain point because it is tricky to get
all of the details right for an real solution.


Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-21  8:14               ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21  8:14 UTC (permalink / raw)
  To: Zhao Lei
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Zhao Lei <zhaolei@cn.fujitsu.com> writes:

> Hi, Eric W. Biederman
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> Sent: Monday, March 21, 2016 2:00 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
>> Mateusz Guzik <mguzik@redhat.com>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
>> 
>> > Currently, each container shared one copy of coredump setting
>> > with the host system, if host system changed the setting, each
>> > running containers will be affected.
>> >
>> > Moreover, it is not easy to let each container keeping their own
>> > coredump setting.
>> >
>> > We can use some workaround as pipe program to make the second
>> > requirement possible, but it is not simple, and both host and
>> > container are limited to set to fixed pipe program.
>> > In one word, for host running contailer, we can't change core_pattern
>> > anymore.
>> > To make the problem more hard, if a host running more than one
>> > container product, each product will try to snatch the global
>> > coredump setting to fit their own requirement.
>> >
>> > For container based on namespace design, it is good to allow
>> > each container keeping their own coredump setting.
>> >
>> > It will bring us following benefit:
>> > 1: Each container can change their own coredump setting
>> >    based on operation on /proc/sys/kernel/core_pattern
>> > 2: Coredump setting changed in host will not affect
>> >    running containers.
>> > 3: Support both case of "putting coredump in guest" and
>> >    "putting curedump in host".
>> >
>> > Each namespace-based software(lxc, docker, ..) can use this function
>> > to custom their dump setting.
>> >
>> > And this function makes each continer working as separate system,
>> > it fit for design goal of namespace
>> 
>> There are a lot of questionable things with this patchset.
>> 
>> > @@ -183,7 +182,7 @@ put_exe_file:
>> >  static int format_corename(struct core_name *cn, struct
>> coredump_params *cprm)
>> >  {
>> >  	const struct cred *cred = current_cred();
>> > -	const char *pat_ptr = core_pattern;
>> > +	const char *pat_ptr =
>> current->nsproxy->pid_ns_for_children->core_pattern;
>> 
>> current->nsproxy->pid_ns_for_children as the name implies is completely
>> inappropriate for getting the pid namespace of the current task.
>> 
>> This should use task_active_pid_namespace.
>> 
> In 5 members in nsproxy struct, pid_ns_for_children seems the best place
> for this variable.

nsproxy is not a magic list of namespaces, nsproxy is to keep
task_struct from expanding and as a trick to keep reference count
increments for namespaces cheap.

> And no variable named task_active_pid_namespace in source,
> could you explain it deeply?

Apologies I mispelled it.  Look in pid_namespace.h at
task_active_pid_ns.  If you want a tasks pid namespace that is the
function to use.

pid_ns_for_children only describes newly forked children.  Which leads
to another problem of your patchset.  I can force your coredump helper
into a pid namespace that the program that dumps core will create it's
children in.

>> >  	int ispipe = (*pat_ptr == '|');
>> >  	int pid_in_pattern = 0;
>> >  	int err = 0;
>> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> > index 918b117..a5af1e9 100644
>> > --- a/include/linux/pid_namespace.h
>> > +++ b/include/linux/pid_namespace.h
>> > @@ -9,6 +9,7 @@
>> >  #include <linux/nsproxy.h>
>> >  #include <linux/kref.h>
>> >  #include <linux/ns_common.h>
>> > +#include <linux/binfmts.h>
>> >
>> >  struct pidmap {
>> >         atomic_t nr_free;
>> > @@ -45,6 +46,7 @@ struct pid_namespace {
>> >  	int hide_pid;
>> >  	int reboot;	/* group exit code if this pidns was rebooted */
>> >  	struct ns_common ns;
>> > +	char core_pattern[CORENAME_MAX_SIZE];
>> >  };
>> >
>> >  extern struct pid_namespace init_pid_ns;
>> > diff --git a/kernel/pid.c b/kernel/pid.c
>> > index 4d73a83..c79c1d5 100644
>> > --- a/kernel/pid.c
>> > +++ b/kernel/pid.c
>> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
>> >  #ifdef CONFIG_PID_NS
>> >  	.ns.ops = &pidns_operations,
>> >  #endif
>> > +	.core_pattern = "core",
>> >  };
>> >  EXPORT_SYMBOL_GPL(init_pid_ns);
>> >
>> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> > index a65ba13..16d6d21 100644
>> > --- a/kernel/pid_namespace.c
>> > +++ b/kernel/pid_namespace.c
>> > @@ -123,6 +123,9 @@ static struct pid_namespace
>> *create_pid_namespace(struct user_namespace *user_ns
>> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
>> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
>> >
>> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
>> > +		sizeof(ns->core_pattern));
>> > +
>> 
>> This is pretty horrible.  You are giving unprivileged processes the
>> ability to run an already specified core dump helper in a pid namespace
>> of their choosing.
>> 
> Similar problem before patch.
> In piped core_pattern setting, any panic process will trigger a
> running of core_dump process.

That is not at all alike.

> Comparing to current code, current code maybe more horrible,
> the guest can destroy host system, and after this patch, the guest
> can only destroy itself.
> (As the script in patch description)

Only if the host is configured to allow itself to be stomped on.  That
is completely a host configured setting.  Yes the host can configure
itself in a way that can cause problems.  But that is the hosts problem.

> Actually it is not so horrible, only the root user can modify code_pattern,
> and normal user/process have no chance to do bad thing.

The argument that a bad design is not bad because only root can do X
does not fly anymore, especially in the presence of containers.

>> That is not backwards compatible, and it is possible this can lead to
>> privilege escalation by triciking a privileged dump process to do
>> something silly because it is running in the wrong pid namespace.
>> 
> In current code, the dump process is forking from kernel thread,
> it is in a most-privileged namespace, dumping contents into host's fs,
> it really cause problem.

It only dumps into the host's fs if it is configured that way.  When you
point a gun at your foot and pull the trigger that really causes
problems as well.  That doesn't make it the gun's problem that it can be
pointed at feet.

> Compare to current code, running dump process in container's
> namespace maybe the right way.
>
> The only thing this patch do is letting dump program running in
> container's namespace instead of host.

The ability to trick a more privileged program to do the wrong thing is
most definitely more than only letting the dump program run in the
container's namespace.  This can be everything up to including getting
root outside of the container.

I will admit that if used with a user namespace what you are doing is
not the worst possible thing to do but it is very definitely a mess.

>> Similarly the entire concept of forking from the program dumping core
>> suffers from the same problem but for all other namespaces.
>> 
>> I was hoping that I would see a justification somewhere in the patch
>> descriptions describing why this set of decisions could be safe.  I do
>> not and so I assume this case was not considered.
>> 
>> If you had managed to fork for the child_reaper of the pid_namespace
>> that set the core pattern (as has been suggested) there would be some
>> chance that things would work correctly.
> Do you mean do fork in kthread(who is running in host's namespace, as corrent code)
> with some special operation to change new thread running in container's
> namespace?
>
>> As you are forking from the program actually dumping core I see no
>> chance that this patchset is either safe or backwards compatible as
>> currently written.
>> 
> Current code have obvious problem, this forking new thread in container's
> namespace is nothing but safe than host's namespace.
> At least we need to solve the problem descripted in script in patch
> description.

The current code has obvious limitations.  But you can in userspace
accomplish mush everything you hope to accomplish here, as all of the
information is available.  It just requires coopeartion.

> The only thing is backwards compatible, as our discussion in v1 patch,
> it is the thing we need to change.

*Laughs*

This code is so absurd in handling the weird cases that I was hoping
that someone else would point out how very very bad it is.  The possible
solutions to the problems have already been discussed to ad nasium and
you have not used any of those solutions.  Although it does seem the
kbuild test robot was not too bad in point out how poor your testing of
these patches was.

At the end of the day I can break any number of current setups with your
patches.  Then there are the security implication of confusing
privileged or somewhat privileged dumping programs.   On the one hand
your patches are not giving the core dumping program enough privielges
to write core dumps, on the other hand you are making it possible to
confuse at least set uid root core dump helpers, leading to privilege
escalations.

In no case does what happens during a core dump follow any version of
the principle of least surprise.


I will agree that the problem you are trying to solve is a pain point,
but as I have said before it is a pain point because it is tricky to get
all of the details right for an real solution.


Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-21  8:14               ` Eric W. Biederman
@ 2016-03-21 10:09                   ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-21 10:09 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> Sent: Monday, March 21, 2016 4:15 PM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> 
> > Hi, Eric W. Biederman
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> >> Sent: Monday, March 21, 2016 2:00 PM
> >> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> >> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> >> Mateusz Guzik <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> >>
> >> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> >>
> >> > Currently, each container shared one copy of coredump setting
> >> > with the host system, if host system changed the setting, each
> >> > running containers will be affected.
> >> >
> >> > Moreover, it is not easy to let each container keeping their own
> >> > coredump setting.
> >> >
> >> > We can use some workaround as pipe program to make the second
> >> > requirement possible, but it is not simple, and both host and
> >> > container are limited to set to fixed pipe program.
> >> > In one word, for host running contailer, we can't change core_pattern
> >> > anymore.
> >> > To make the problem more hard, if a host running more than one
> >> > container product, each product will try to snatch the global
> >> > coredump setting to fit their own requirement.
> >> >
> >> > For container based on namespace design, it is good to allow
> >> > each container keeping their own coredump setting.
> >> >
> >> > It will bring us following benefit:
> >> > 1: Each container can change their own coredump setting
> >> >    based on operation on /proc/sys/kernel/core_pattern
> >> > 2: Coredump setting changed in host will not affect
> >> >    running containers.
> >> > 3: Support both case of "putting coredump in guest" and
> >> >    "putting curedump in host".
> >> >
> >> > Each namespace-based software(lxc, docker, ..) can use this function
> >> > to custom their dump setting.
> >> >
> >> > And this function makes each continer working as separate system,
> >> > it fit for design goal of namespace
> >>
> >> There are a lot of questionable things with this patchset.
> >>
> >> > @@ -183,7 +182,7 @@ put_exe_file:
> >> >  static int format_corename(struct core_name *cn, struct
> >> coredump_params *cprm)
> >> >  {
> >> >  	const struct cred *cred = current_cred();
> >> > -	const char *pat_ptr = core_pattern;
> >> > +	const char *pat_ptr =
> >> current->nsproxy->pid_ns_for_children->core_pattern;
> >>
> >> current->nsproxy->pid_ns_for_children as the name implies is completely
> >> inappropriate for getting the pid namespace of the current task.
> >>
> >> This should use task_active_pid_namespace.
> >>
> > In 5 members in nsproxy struct, pid_ns_for_children seems the best place
> > for this variable.
> 
> nsproxy is not a magic list of namespaces, nsproxy is to keep
> task_struct from expanding and as a trick to keep reference count
> increments for namespaces cheap.
> 
> > And no variable named task_active_pid_namespace in source,
> > could you explain it deeply?
> 
> Apologies I mispelled it.  Look in pid_namespace.h at
> task_active_pid_ns.  If you want a tasks pid namespace that is the
> function to use.
> 
> pid_ns_for_children only describes newly forked children.  Which leads
> to another problem of your patchset.  I can force your coredump helper
> into a pid namespace that the program that dumps core will create it's
> children in.
> 
> >> >  	int ispipe = (*pat_ptr == '|');
> >> >  	int pid_in_pattern = 0;
> >> >  	int err = 0;
> >> > diff --git a/include/linux/pid_namespace.h
> b/include/linux/pid_namespace.h
> >> > index 918b117..a5af1e9 100644
> >> > --- a/include/linux/pid_namespace.h
> >> > +++ b/include/linux/pid_namespace.h
> >> > @@ -9,6 +9,7 @@
> >> >  #include <linux/nsproxy.h>
> >> >  #include <linux/kref.h>
> >> >  #include <linux/ns_common.h>
> >> > +#include <linux/binfmts.h>
> >> >
> >> >  struct pidmap {
> >> >         atomic_t nr_free;
> >> > @@ -45,6 +46,7 @@ struct pid_namespace {
> >> >  	int hide_pid;
> >> >  	int reboot;	/* group exit code if this pidns was rebooted */
> >> >  	struct ns_common ns;
> >> > +	char core_pattern[CORENAME_MAX_SIZE];
> >> >  };
> >> >
> >> >  extern struct pid_namespace init_pid_ns;
> >> > diff --git a/kernel/pid.c b/kernel/pid.c
> >> > index 4d73a83..c79c1d5 100644
> >> > --- a/kernel/pid.c
> >> > +++ b/kernel/pid.c
> >> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
> >> >  #ifdef CONFIG_PID_NS
> >> >  	.ns.ops = &pidns_operations,
> >> >  #endif
> >> > +	.core_pattern = "core",
> >> >  };
> >> >  EXPORT_SYMBOL_GPL(init_pid_ns);
> >> >
> >> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> >> > index a65ba13..16d6d21 100644
> >> > --- a/kernel/pid_namespace.c
> >> > +++ b/kernel/pid_namespace.c
> >> > @@ -123,6 +123,9 @@ static struct pid_namespace
> >> *create_pid_namespace(struct user_namespace *user_ns
> >> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
> >> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
> >> >
> >> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> >> > +		sizeof(ns->core_pattern));
> >> > +
> >>
> >> This is pretty horrible.  You are giving unprivileged processes the
> >> ability to run an already specified core dump helper in a pid namespace
> >> of their choosing.
> >>
> > Similar problem before patch.
> > In piped core_pattern setting, any panic process will trigger a
> > running of core_dump process.
> 
> That is not at all alike.
> 
> > Comparing to current code, current code maybe more horrible,
> > the guest can destroy host system, and after this patch, the guest
> > can only destroy itself.
> > (As the script in patch description)
> 
> Only if the host is configured to allow itself to be stomped on.  That
> is completely a host configured setting.  Yes the host can configure
> itself in a way that can cause problems.  But that is the hosts problem.
> 
> > Actually it is not so horrible, only the root user can modify code_pattern,
> > and normal user/process have no chance to do bad thing.
> 
> The argument that a bad design is not bad because only root can do X
> does not fly anymore, especially in the presence of containers.
> 
> >> That is not backwards compatible, and it is possible this can lead to
> >> privilege escalation by triciking a privileged dump process to do
> >> something silly because it is running in the wrong pid namespace.
> >>
> > In current code, the dump process is forking from kernel thread,
> > it is in a most-privileged namespace, dumping contents into host's fs,
> > it really cause problem.
> 
> It only dumps into the host's fs if it is configured that way.  When you
> point a gun at your foot and pull the trigger that really causes
> problems as well.  That doesn't make it the gun's problem that it can be
> pointed at feet.
> 
> > Compare to current code, running dump process in container's
> > namespace maybe the right way.
> >
> > The only thing this patch do is letting dump program running in
> > container's namespace instead of host.
> 
> The ability to trick a more privileged program to do the wrong thing is
> most definitely more than only letting the dump program run in the
> container's namespace.  This can be everything up to including getting
> root outside of the container.
> 
> I will admit that if used with a user namespace what you are doing is
> not the worst possible thing to do but it is very definitely a mess.
> 
> >> Similarly the entire concept of forking from the program dumping core
> >> suffers from the same problem but for all other namespaces.
> >>
> >> I was hoping that I would see a justification somewhere in the patch
> >> descriptions describing why this set of decisions could be safe.  I do
> >> not and so I assume this case was not considered.
> >>
> >> If you had managed to fork for the child_reaper of the pid_namespace
> >> that set the core pattern (as has been suggested) there would be some
> >> chance that things would work correctly.
> > Do you mean do fork in kthread(who is running in host's namespace, as
> corrent code)
> > with some special operation to change new thread running in container's
> > namespace?
> >
> >> As you are forking from the program actually dumping core I see no
> >> chance that this patchset is either safe or backwards compatible as
> >> currently written.
> >>
> > Current code have obvious problem, this forking new thread in container's
> > namespace is nothing but safe than host's namespace.
> > At least we need to solve the problem descripted in script in patch
> > description.
> 
> The current code has obvious limitations.  But you can in userspace
> accomplish mush everything you hope to accomplish here, as all of the
> information is available.  It just requires coopeartion.
> 
> > The only thing is backwards compatible, as our discussion in v1 patch,
> > it is the thing we need to change.
> 
> *Laughs*
> 
> This code is so absurd in handling the weird cases that I was hoping
> that someone else would point out how very very bad it is.  The possible
> solutions to the problems have already been discussed to ad nasium and
> you have not used any of those solutions.  Although it does seem the
> kbuild test robot was not too bad in point out how poor your testing of
> these patches was.
> 
> At the end of the day I can break any number of current setups with your
> patches.  Then there are the security implication of confusing
> privileged or somewhat privileged dumping programs.   On the one hand
> your patches are not giving the core dumping program enough privielges
> to write core dumps, on the other hand you are making it possible to
> confuse at least set uid root core dump helpers, leading to privilege
> escalations.
> 
> In no case does what happens during a core dump follow any version of
> the principle of least surprise.
> 
> 
> I will agree that the problem you are trying to solve is a pain point,
> but as I have said before it is a pain point because it is tricky to get
> all of the details right for an real solution.
> 
Let me make a summarize:
You think this way is not acceptable, because the pipe program is running
in the panic-process's namespace context.

And in my view, a pipe program in the host's top level namespace is also
a problem.

Let us think a container, to make it act as a real machine, when a program
panic, linux kernel should dump it into the container's filesystem.

For the kernel, to keep the current way of forking pipe program by kthread,
just let the pipe thread running in the container's namespace, instead the host,
may solve the problem in current kernel.

What is your opinion?

Btw, this patch is trying to solve the problem descripted in thread named:
"piping core dump to a program escapes container" in
http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.html
Maybe using a userspace tool can make container dump to anywhere,
but for kernel ifself, it is better to solve above problem if we can.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-21 10:09                   ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-21 10:09 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Monday, March 21, 2016 4:15 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu@jp.fujitsu.com>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> 
> > Hi, Eric W. Biederman
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> >> Sent: Monday, March 21, 2016 2:00 PM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> >> Mateusz Guzik <mguzik@redhat.com>
> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> >>
> >> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> >>
> >> > Currently, each container shared one copy of coredump setting
> >> > with the host system, if host system changed the setting, each
> >> > running containers will be affected.
> >> >
> >> > Moreover, it is not easy to let each container keeping their own
> >> > coredump setting.
> >> >
> >> > We can use some workaround as pipe program to make the second
> >> > requirement possible, but it is not simple, and both host and
> >> > container are limited to set to fixed pipe program.
> >> > In one word, for host running contailer, we can't change core_pattern
> >> > anymore.
> >> > To make the problem more hard, if a host running more than one
> >> > container product, each product will try to snatch the global
> >> > coredump setting to fit their own requirement.
> >> >
> >> > For container based on namespace design, it is good to allow
> >> > each container keeping their own coredump setting.
> >> >
> >> > It will bring us following benefit:
> >> > 1: Each container can change their own coredump setting
> >> >    based on operation on /proc/sys/kernel/core_pattern
> >> > 2: Coredump setting changed in host will not affect
> >> >    running containers.
> >> > 3: Support both case of "putting coredump in guest" and
> >> >    "putting curedump in host".
> >> >
> >> > Each namespace-based software(lxc, docker, ..) can use this function
> >> > to custom their dump setting.
> >> >
> >> > And this function makes each continer working as separate system,
> >> > it fit for design goal of namespace
> >>
> >> There are a lot of questionable things with this patchset.
> >>
> >> > @@ -183,7 +182,7 @@ put_exe_file:
> >> >  static int format_corename(struct core_name *cn, struct
> >> coredump_params *cprm)
> >> >  {
> >> >  	const struct cred *cred = current_cred();
> >> > -	const char *pat_ptr = core_pattern;
> >> > +	const char *pat_ptr =
> >> current->nsproxy->pid_ns_for_children->core_pattern;
> >>
> >> current->nsproxy->pid_ns_for_children as the name implies is completely
> >> inappropriate for getting the pid namespace of the current task.
> >>
> >> This should use task_active_pid_namespace.
> >>
> > In 5 members in nsproxy struct, pid_ns_for_children seems the best place
> > for this variable.
> 
> nsproxy is not a magic list of namespaces, nsproxy is to keep
> task_struct from expanding and as a trick to keep reference count
> increments for namespaces cheap.
> 
> > And no variable named task_active_pid_namespace in source,
> > could you explain it deeply?
> 
> Apologies I mispelled it.  Look in pid_namespace.h at
> task_active_pid_ns.  If you want a tasks pid namespace that is the
> function to use.
> 
> pid_ns_for_children only describes newly forked children.  Which leads
> to another problem of your patchset.  I can force your coredump helper
> into a pid namespace that the program that dumps core will create it's
> children in.
> 
> >> >  	int ispipe = (*pat_ptr == '|');
> >> >  	int pid_in_pattern = 0;
> >> >  	int err = 0;
> >> > diff --git a/include/linux/pid_namespace.h
> b/include/linux/pid_namespace.h
> >> > index 918b117..a5af1e9 100644
> >> > --- a/include/linux/pid_namespace.h
> >> > +++ b/include/linux/pid_namespace.h
> >> > @@ -9,6 +9,7 @@
> >> >  #include <linux/nsproxy.h>
> >> >  #include <linux/kref.h>
> >> >  #include <linux/ns_common.h>
> >> > +#include <linux/binfmts.h>
> >> >
> >> >  struct pidmap {
> >> >         atomic_t nr_free;
> >> > @@ -45,6 +46,7 @@ struct pid_namespace {
> >> >  	int hide_pid;
> >> >  	int reboot;	/* group exit code if this pidns was rebooted */
> >> >  	struct ns_common ns;
> >> > +	char core_pattern[CORENAME_MAX_SIZE];
> >> >  };
> >> >
> >> >  extern struct pid_namespace init_pid_ns;
> >> > diff --git a/kernel/pid.c b/kernel/pid.c
> >> > index 4d73a83..c79c1d5 100644
> >> > --- a/kernel/pid.c
> >> > +++ b/kernel/pid.c
> >> > @@ -83,6 +83,7 @@ struct pid_namespace init_pid_ns = {
> >> >  #ifdef CONFIG_PID_NS
> >> >  	.ns.ops = &pidns_operations,
> >> >  #endif
> >> > +	.core_pattern = "core",
> >> >  };
> >> >  EXPORT_SYMBOL_GPL(init_pid_ns);
> >> >
> >> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> >> > index a65ba13..16d6d21 100644
> >> > --- a/kernel/pid_namespace.c
> >> > +++ b/kernel/pid_namespace.c
> >> > @@ -123,6 +123,9 @@ static struct pid_namespace
> >> *create_pid_namespace(struct user_namespace *user_ns
> >> >  	for (i = 1; i < PIDMAP_ENTRIES; i++)
> >> >  		atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
> >> >
> >> > +	strncpy(ns->core_pattern, parent_pid_ns->core_pattern,
> >> > +		sizeof(ns->core_pattern));
> >> > +
> >>
> >> This is pretty horrible.  You are giving unprivileged processes the
> >> ability to run an already specified core dump helper in a pid namespace
> >> of their choosing.
> >>
> > Similar problem before patch.
> > In piped core_pattern setting, any panic process will trigger a
> > running of core_dump process.
> 
> That is not at all alike.
> 
> > Comparing to current code, current code maybe more horrible,
> > the guest can destroy host system, and after this patch, the guest
> > can only destroy itself.
> > (As the script in patch description)
> 
> Only if the host is configured to allow itself to be stomped on.  That
> is completely a host configured setting.  Yes the host can configure
> itself in a way that can cause problems.  But that is the hosts problem.
> 
> > Actually it is not so horrible, only the root user can modify code_pattern,
> > and normal user/process have no chance to do bad thing.
> 
> The argument that a bad design is not bad because only root can do X
> does not fly anymore, especially in the presence of containers.
> 
> >> That is not backwards compatible, and it is possible this can lead to
> >> privilege escalation by triciking a privileged dump process to do
> >> something silly because it is running in the wrong pid namespace.
> >>
> > In current code, the dump process is forking from kernel thread,
> > it is in a most-privileged namespace, dumping contents into host's fs,
> > it really cause problem.
> 
> It only dumps into the host's fs if it is configured that way.  When you
> point a gun at your foot and pull the trigger that really causes
> problems as well.  That doesn't make it the gun's problem that it can be
> pointed at feet.
> 
> > Compare to current code, running dump process in container's
> > namespace maybe the right way.
> >
> > The only thing this patch do is letting dump program running in
> > container's namespace instead of host.
> 
> The ability to trick a more privileged program to do the wrong thing is
> most definitely more than only letting the dump program run in the
> container's namespace.  This can be everything up to including getting
> root outside of the container.
> 
> I will admit that if used with a user namespace what you are doing is
> not the worst possible thing to do but it is very definitely a mess.
> 
> >> Similarly the entire concept of forking from the program dumping core
> >> suffers from the same problem but for all other namespaces.
> >>
> >> I was hoping that I would see a justification somewhere in the patch
> >> descriptions describing why this set of decisions could be safe.  I do
> >> not and so I assume this case was not considered.
> >>
> >> If you had managed to fork for the child_reaper of the pid_namespace
> >> that set the core pattern (as has been suggested) there would be some
> >> chance that things would work correctly.
> > Do you mean do fork in kthread(who is running in host's namespace, as
> corrent code)
> > with some special operation to change new thread running in container's
> > namespace?
> >
> >> As you are forking from the program actually dumping core I see no
> >> chance that this patchset is either safe or backwards compatible as
> >> currently written.
> >>
> > Current code have obvious problem, this forking new thread in container's
> > namespace is nothing but safe than host's namespace.
> > At least we need to solve the problem descripted in script in patch
> > description.
> 
> The current code has obvious limitations.  But you can in userspace
> accomplish mush everything you hope to accomplish here, as all of the
> information is available.  It just requires coopeartion.
> 
> > The only thing is backwards compatible, as our discussion in v1 patch,
> > it is the thing we need to change.
> 
> *Laughs*
> 
> This code is so absurd in handling the weird cases that I was hoping
> that someone else would point out how very very bad it is.  The possible
> solutions to the problems have already been discussed to ad nasium and
> you have not used any of those solutions.  Although it does seem the
> kbuild test robot was not too bad in point out how poor your testing of
> these patches was.
> 
> At the end of the day I can break any number of current setups with your
> patches.  Then there are the security implication of confusing
> privileged or somewhat privileged dumping programs.   On the one hand
> your patches are not giving the core dumping program enough privielges
> to write core dumps, on the other hand you are making it possible to
> confuse at least set uid root core dump helpers, leading to privilege
> escalations.
> 
> In no case does what happens during a core dump follow any version of
> the principle of least surprise.
> 
> 
> I will agree that the problem you are trying to solve is a pain point,
> but as I have said before it is a pain point because it is tricky to get
> all of the details right for an real solution.
> 
Let me make a summarize:
You think this way is not acceptable, because the pipe program is running
in the panic-process's namespace context.

And in my view, a pipe program in the host's top level namespace is also
a problem.

Let us think a container, to make it act as a real machine, when a program
panic, linux kernel should dump it into the container's filesystem.

For the kernel, to keep the current way of forking pipe program by kthread,
just let the pipe thread running in the container's namespace, instead the host,
may solve the problem in current kernel.

What is your opinion?

Btw, this patch is trying to solve the problem descripted in thread named:
"piping core dump to a program escapes container" in
http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.html
Maybe using a userspace tool can make container dump to anywhere,
but for kernel ifself, it is better to solve above problem if we can.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-21 10:09                   ` Zhao Lei
@ 2016-03-21 21:24                     ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21 21:24 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]

> Let me make a summarize:
> You think this way is not acceptable, because the pipe program is running
> in the panic-process's namespace context.

Actually my view is that your patchset is not acceptable because it
is implemented in a way that is not backwards compatible (AKA it can
break existing configurations that remain unchanged) and your
implementation does not appear in the least safe from malicious users.

There is also a problem that your patchset is simply buggy for what it
tries to implement, as using pid_ns_for_children and the multiple kbuild
robot emails testifies.

> And in my view, a pipe program in the host's top level namespace is also
> a problem.
>
> Let us think a container, to make it act as a real machine, when a program
> panic, linux kernel should dump it into the container's filesystem.
>
> For the kernel, to keep the current way of forking pipe program by kthread,
> just let the pipe thread running in the container's namespace, instead the host,
> may solve the problem in current kernel.
>
> What is your opinion?
>
> Btw, this patch is trying to solve the problem descripted in thread named:
> "piping core dump to a program escapes container" in
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.html
> Maybe using a userspace tool can make container dump to anywhere,
> but for kernel ifself, it is better to solve above problem if we can.

I think it would be great to find a way to run a core dump helper and
otherwise allow setting the core dump pattern in a container in a way
that is safe from malicious users and does not break existing setups.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-21 21:24                     ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-21 21:24 UTC (permalink / raw)
  To: Zhao Lei
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Zhao Lei <zhaolei@cn.fujitsu.com> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xmission.com]

> Let me make a summarize:
> You think this way is not acceptable, because the pipe program is running
> in the panic-process's namespace context.

Actually my view is that your patchset is not acceptable because it
is implemented in a way that is not backwards compatible (AKA it can
break existing configurations that remain unchanged) and your
implementation does not appear in the least safe from malicious users.

There is also a problem that your patchset is simply buggy for what it
tries to implement, as using pid_ns_for_children and the multiple kbuild
robot emails testifies.

> And in my view, a pipe program in the host's top level namespace is also
> a problem.
>
> Let us think a container, to make it act as a real machine, when a program
> panic, linux kernel should dump it into the container's filesystem.
>
> For the kernel, to keep the current way of forking pipe program by kthread,
> just let the pipe thread running in the container's namespace, instead the host,
> may solve the problem in current kernel.
>
> What is your opinion?
>
> Btw, this patch is trying to solve the problem descripted in thread named:
> "piping core dump to a program escapes container" in
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.html
> Maybe using a userspace tool can make container dump to anywhere,
> but for kernel ifself, it is better to solve above problem if we can.

I think it would be great to find a way to run a core dump helper and
otherwise allow setting the core dump pattern in a container in a way
that is safe from malicious users and does not break existing setups.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-21 21:24                     ` Eric W. Biederman
@ 2016-03-22  1:38                         ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-22  1:38 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> Sent: Tuesday, March 22, 2016 5:25 AM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> 
> > Hi, Eric
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> 
> > Let me make a summarize:
> > You think this way is not acceptable, because the pipe program is running
> > in the panic-process's namespace context.
> 
> Actually my view is that your patchset is not acceptable because it
> is implemented in a way that is not backwards compatible (AKA it can
> break existing configurations that remain unchanged) and your
> implementation does not appear in the least safe from malicious users.
> 
> There is also a problem that your patchset is simply buggy for what it
> tries to implement, as using pid_ns_for_children and the multiple kbuild
> robot emails testifies.
> 
> > And in my view, a pipe program in the host's top level namespace is also
> > a problem.
> >
> > Let us think a container, to make it act as a real machine, when a program
> > panic, linux kernel should dump it into the container's filesystem.
> >
> > For the kernel, to keep the current way of forking pipe program by kthread,
> > just let the pipe thread running in the container's namespace, instead the
> host,
> > may solve the problem in current kernel.
> >
> > What is your opinion?
> >
> > Btw, this patch is trying to solve the problem descripted in thread named:
> > "piping core dump to a program escapes container" in
> >
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
> html
> > Maybe using a userspace tool can make container dump to anywhere,
> > but for kernel ifself, it is better to solve above problem if we can.
> 
> I think it would be great to find a way to run a core dump helper and
> otherwise allow setting the core dump pattern in a container in a way
> that is safe from malicious users and does not break existing setups.
> 
So, there is following problem:
1: safe from malicious users
  We can try to find a way to fork process which have no relationship
  with the panic process.
2: Bug in patch
  It can be fixed, but I'd rather get a conclusion of this discussion
  before fix.
3: Backwards compatible
  It maybe the biggest problem in discussion, this patch is used to let
  container dump files into container, it is different with current action.
  Before patch:
    File type dump_pattern: dump to container
    Pipe type dump_pattern: dump to host
  After patch:
    File type dump_pattern: dump to container
    Pipe type dump_pattern: dump to container
  The second design seems better but not compatible with current kernel,
  but this patch can not fix to keep compatible because it is the patch's
  function.
  Maybe we can make some workagound, as:
  a. Add a kernel config to let the old style as default.
  b. keep old style, and add "||" for core_pattern, as
    echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
    to dump to container.

  What is your opinion about it?

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-22  1:38                         ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-22  1:38 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Tuesday, March 22, 2016 5:25 AM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu@jp.fujitsu.com>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> 
> > Hi, Eric
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> 
> > Let me make a summarize:
> > You think this way is not acceptable, because the pipe program is running
> > in the panic-process's namespace context.
> 
> Actually my view is that your patchset is not acceptable because it
> is implemented in a way that is not backwards compatible (AKA it can
> break existing configurations that remain unchanged) and your
> implementation does not appear in the least safe from malicious users.
> 
> There is also a problem that your patchset is simply buggy for what it
> tries to implement, as using pid_ns_for_children and the multiple kbuild
> robot emails testifies.
> 
> > And in my view, a pipe program in the host's top level namespace is also
> > a problem.
> >
> > Let us think a container, to make it act as a real machine, when a program
> > panic, linux kernel should dump it into the container's filesystem.
> >
> > For the kernel, to keep the current way of forking pipe program by kthread,
> > just let the pipe thread running in the container's namespace, instead the
> host,
> > may solve the problem in current kernel.
> >
> > What is your opinion?
> >
> > Btw, this patch is trying to solve the problem descripted in thread named:
> > "piping core dump to a program escapes container" in
> >
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
> html
> > Maybe using a userspace tool can make container dump to anywhere,
> > but for kernel ifself, it is better to solve above problem if we can.
> 
> I think it would be great to find a way to run a core dump helper and
> otherwise allow setting the core dump pattern in a container in a way
> that is safe from malicious users and does not break existing setups.
> 
So, there is following problem:
1: safe from malicious users
  We can try to find a way to fork process which have no relationship
  with the panic process.
2: Bug in patch
  It can be fixed, but I'd rather get a conclusion of this discussion
  before fix.
3: Backwards compatible
  It maybe the biggest problem in discussion, this patch is used to let
  container dump files into container, it is different with current action.
  Before patch:
    File type dump_pattern: dump to container
    Pipe type dump_pattern: dump to host
  After patch:
    File type dump_pattern: dump to container
    Pipe type dump_pattern: dump to container
  The second design seems better but not compatible with current kernel,
  but this patch can not fix to keep compatible because it is the patch's
  function.
  Maybe we can make some workagound, as:
  a. Add a kernel config to let the old style as default.
  b. keep old style, and add "||" for core_pattern, as
    echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
    to dump to container.

  What is your opinion about it?

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-22  1:38                         ` Zhao Lei
@ 2016-03-22 22:42                           ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-22 22:42 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> Sent: Tuesday, March 22, 2016 5:25 AM
>> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
>> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
>> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
>> 
>> > Hi, Eric
>> >
>> >> -----Original Message-----
>> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> 
>> > Let me make a summarize:
>> > You think this way is not acceptable, because the pipe program is running
>> > in the panic-process's namespace context.
>> 
>> Actually my view is that your patchset is not acceptable because it
>> is implemented in a way that is not backwards compatible (AKA it can
>> break existing configurations that remain unchanged) and your
>> implementation does not appear in the least safe from malicious users.
>> 
>> There is also a problem that your patchset is simply buggy for what it
>> tries to implement, as using pid_ns_for_children and the multiple kbuild
>> robot emails testifies.
>> 
>> > And in my view, a pipe program in the host's top level namespace is also
>> > a problem.
>> >
>> > Let us think a container, to make it act as a real machine, when a program
>> > panic, linux kernel should dump it into the container's filesystem.
>> >
>> > For the kernel, to keep the current way of forking pipe program by kthread,
>> > just let the pipe thread running in the container's namespace, instead the
>> host,
>> > may solve the problem in current kernel.
>> >
>> > What is your opinion?
>> >
>> > Btw, this patch is trying to solve the problem descripted in thread named:
>> > "piping core dump to a program escapes container" in
>> >
>> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
>> html
>> > Maybe using a userspace tool can make container dump to anywhere,
>> > but for kernel ifself, it is better to solve above problem if we can.
>> 
>> I think it would be great to find a way to run a core dump helper and
>> otherwise allow setting the core dump pattern in a container in a way
>> that is safe from malicious users and does not break existing setups.
>> 
> So, there is following problem:
> 1: safe from malicious users
>   We can try to find a way to fork process which have no relationship
>   with the panic process.
> 2: Bug in patch
>   It can be fixed, but I'd rather get a conclusion of this discussion
>   before fix.
> 3: Backwards compatible
>   It maybe the biggest problem in discussion, this patch is used to let
>   container dump files into container, it is different with current action.
>   Before patch:
>     File type dump_pattern: dump to container
>     Pipe type dump_pattern: dump to host
>   After patch:
>     File type dump_pattern: dump to container
>     Pipe type dump_pattern: dump to container
>   The second design seems better but not compatible with current kernel,
>   but this patch can not fix to keep compatible because it is the patch's
>   function.
>   Maybe we can make some workagound, as:
>   a. Add a kernel config to let the old style as default.
>   b. keep old style, and add "||" for core_pattern, as
>     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
>     to dump to container.
>
>   What is your opinion about it?

There are two parts to backwards compatibility.

1) How should core patterns that are set outside of any container be
   treated?

   The patchset under discussion imported core patterns set outside of
   containers into containers completely trivially changing their
   behavior and breaking backwards compatibility.  That is just not
   acceptable.

2) How should core patterns inside of containers be treated?

   There are corner cases that I am not thinking of that will cause
   regressions but I think we can reasonably assume that core patterns
   are not set inside of containers today.  Assuming that is true,
   then setting a core pattern inside of a container is the only thing
   that the kernel needs to detect to handle working inside of a
   container.

   The practical question I see for this case is which parent process
   needs to be used for your core dump helper, and which set of
   namespaces that parent helper has.

   I will also remind people thinking about these issues that containers
   can be run recursisvely.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-22 22:42                           ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-22 22:42 UTC (permalink / raw)
  To: Zhao Lei
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Zhao Lei <zhaolei@cn.fujitsu.com> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> Sent: Tuesday, March 22, 2016 5:25 AM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
>> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
>> <kamezawa.hiroyu@jp.fujitsu.com>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
>> 
>> > Hi, Eric
>> >
>> >> -----Original Message-----
>> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> 
>> > Let me make a summarize:
>> > You think this way is not acceptable, because the pipe program is running
>> > in the panic-process's namespace context.
>> 
>> Actually my view is that your patchset is not acceptable because it
>> is implemented in a way that is not backwards compatible (AKA it can
>> break existing configurations that remain unchanged) and your
>> implementation does not appear in the least safe from malicious users.
>> 
>> There is also a problem that your patchset is simply buggy for what it
>> tries to implement, as using pid_ns_for_children and the multiple kbuild
>> robot emails testifies.
>> 
>> > And in my view, a pipe program in the host's top level namespace is also
>> > a problem.
>> >
>> > Let us think a container, to make it act as a real machine, when a program
>> > panic, linux kernel should dump it into the container's filesystem.
>> >
>> > For the kernel, to keep the current way of forking pipe program by kthread,
>> > just let the pipe thread running in the container's namespace, instead the
>> host,
>> > may solve the problem in current kernel.
>> >
>> > What is your opinion?
>> >
>> > Btw, this patch is trying to solve the problem descripted in thread named:
>> > "piping core dump to a program escapes container" in
>> >
>> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
>> html
>> > Maybe using a userspace tool can make container dump to anywhere,
>> > but for kernel ifself, it is better to solve above problem if we can.
>> 
>> I think it would be great to find a way to run a core dump helper and
>> otherwise allow setting the core dump pattern in a container in a way
>> that is safe from malicious users and does not break existing setups.
>> 
> So, there is following problem:
> 1: safe from malicious users
>   We can try to find a way to fork process which have no relationship
>   with the panic process.
> 2: Bug in patch
>   It can be fixed, but I'd rather get a conclusion of this discussion
>   before fix.
> 3: Backwards compatible
>   It maybe the biggest problem in discussion, this patch is used to let
>   container dump files into container, it is different with current action.
>   Before patch:
>     File type dump_pattern: dump to container
>     Pipe type dump_pattern: dump to host
>   After patch:
>     File type dump_pattern: dump to container
>     Pipe type dump_pattern: dump to container
>   The second design seems better but not compatible with current kernel,
>   but this patch can not fix to keep compatible because it is the patch's
>   function.
>   Maybe we can make some workagound, as:
>   a. Add a kernel config to let the old style as default.
>   b. keep old style, and add "||" for core_pattern, as
>     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
>     to dump to container.
>
>   What is your opinion about it?

There are two parts to backwards compatibility.

1) How should core patterns that are set outside of any container be
   treated?

   The patchset under discussion imported core patterns set outside of
   containers into containers completely trivially changing their
   behavior and breaking backwards compatibility.  That is just not
   acceptable.

2) How should core patterns inside of containers be treated?

   There are corner cases that I am not thinking of that will cause
   regressions but I think we can reasonably assume that core patterns
   are not set inside of containers today.  Assuming that is true,
   then setting a core pattern inside of a container is the only thing
   that the kernel needs to detect to handle working inside of a
   container.

   The practical question I see for this case is which parent process
   needs to be used for your core dump helper, and which set of
   namespaces that parent helper has.

   I will also remind people thinking about these issues that containers
   can be run recursisvely.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-22 22:42                           ` Eric W. Biederman
@ 2016-03-23 11:58                               ` Zhao Lei
  -1 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-23 11:58 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> Sent: Wednesday, March 23, 2016 6:43 AM
> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> 
> > Hi, Eric
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> >> Sent: Tuesday, March 22, 2016 5:25 AM
> >> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> >> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> >> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
> >> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> >>
> >> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> >>
> >> > Hi, Eric
> >> >
> >> >> -----Original Message-----
> >> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
> >>
> >> > Let me make a summarize:
> >> > You think this way is not acceptable, because the pipe program is running
> >> > in the panic-process's namespace context.
> >>
> >> Actually my view is that your patchset is not acceptable because it
> >> is implemented in a way that is not backwards compatible (AKA it can
> >> break existing configurations that remain unchanged) and your
> >> implementation does not appear in the least safe from malicious users.
> >>
> >> There is also a problem that your patchset is simply buggy for what it
> >> tries to implement, as using pid_ns_for_children and the multiple kbuild
> >> robot emails testifies.
> >>
> >> > And in my view, a pipe program in the host's top level namespace is also
> >> > a problem.
> >> >
> >> > Let us think a container, to make it act as a real machine, when a program
> >> > panic, linux kernel should dump it into the container's filesystem.
> >> >
> >> > For the kernel, to keep the current way of forking pipe program by kthread,
> >> > just let the pipe thread running in the container's namespace, instead the
> >> host,
> >> > may solve the problem in current kernel.
> >> >
> >> > What is your opinion?
> >> >
> >> > Btw, this patch is trying to solve the problem descripted in thread named:
> >> > "piping core dump to a program escapes container" in
> >> >
> >>
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
> >> html
> >> > Maybe using a userspace tool can make container dump to anywhere,
> >> > but for kernel ifself, it is better to solve above problem if we can.
> >>
> >> I think it would be great to find a way to run a core dump helper and
> >> otherwise allow setting the core dump pattern in a container in a way
> >> that is safe from malicious users and does not break existing setups.
> >>
> > So, there is following problem:
> > 1: safe from malicious users
> >   We can try to find a way to fork process which have no relationship
> >   with the panic process.
> > 2: Bug in patch
> >   It can be fixed, but I'd rather get a conclusion of this discussion
> >   before fix.
> > 3: Backwards compatible
> >   It maybe the biggest problem in discussion, this patch is used to let
> >   container dump files into container, it is different with current action.
> >   Before patch:
> >     File type dump_pattern: dump to container
> >     Pipe type dump_pattern: dump to host
> >   After patch:
> >     File type dump_pattern: dump to container
> >     Pipe type dump_pattern: dump to container
> >   The second design seems better but not compatible with current kernel,
> >   but this patch can not fix to keep compatible because it is the patch's
> >   function.
> >   Maybe we can make some workagound, as:
> >   a. Add a kernel config to let the old style as default.
> >   b. keep old style, and add "||" for core_pattern, as
> >     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
> >     to dump to container.
> >
> >   What is your opinion about it?
> 

Thanks for detailed answer.

> There are two parts to backwards compatibility.
> 
> 1) How should core patterns that are set outside of any container be
>    treated?
> 
>    The patchset under discussion imported core patterns set outside of
>    containers into containers completely trivially changing their
>    behavior and breaking backwards compatibility.  That is just not
>    acceptable.
> 
Before this patch, setting pattern outside container will change setting
in container.
And after this patch, host and container are separated, they have respective
setting, and no relationship except the container will copy host's setting
in create.

If we don't think compatibility, it may be a good idea, the container is only
allowed to change the core_pattern by itself.
It is strange that the core_pattern in container was changed but user/process
in container do nothing.

But just as you said, it have compatibility problem, someone maybe using
this feature to modify core_pattern in container in host.

To keep the compatibility and allow custom core_pattern in continer,
maybe we can:
1: If no process setting core_pattern in container, the container's
  core_pattern keep same copy with host.
  And if core_pattern in host changed this time, the container's pattern
  changed with host.
2: If core_pattern in container changed by some one/process in container,
  it is separated with host, and modification in host will not affect container.

What is your opinion about it?

> 2) How should core patterns inside of containers be treated?
> 
>    There are corner cases that I am not thinking of that will cause
>    regressions but I think we can reasonably assume that core patterns
>    are not set inside of containers today.  Assuming that is true,
>    then setting a core pattern inside of a container is the only thing
>    that the kernel needs to detect to handle working inside of a
>    container.
> 
>    The practical question I see for this case is which parent process
>    needs to be used for your core dump helper, and which set of
>    namespaces that parent helper has.
> 
>    I will also remind people thinking about these issues that containers
>    can be run recursisvely.
> 
Got it.
I'll try to find a way.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-23 11:58                               ` Zhao Lei
  0 siblings, 0 replies; 28+ messages in thread
From: Zhao Lei @ 2016-03-23 11:58 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Hi, Eric

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Wednesday, March 23, 2016 6:43 AM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
> <kamezawa.hiroyu@jp.fujitsu.com>
> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> 
> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> 
> > Hi, Eric
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> >> Sent: Tuesday, March 22, 2016 5:25 AM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
> >> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
> >> <kamezawa.hiroyu@jp.fujitsu.com>
> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
> >>
> >> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
> >>
> >> > Hi, Eric
> >> >
> >> >> -----Original Message-----
> >> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> >>
> >> > Let me make a summarize:
> >> > You think this way is not acceptable, because the pipe program is running
> >> > in the panic-process's namespace context.
> >>
> >> Actually my view is that your patchset is not acceptable because it
> >> is implemented in a way that is not backwards compatible (AKA it can
> >> break existing configurations that remain unchanged) and your
> >> implementation does not appear in the least safe from malicious users.
> >>
> >> There is also a problem that your patchset is simply buggy for what it
> >> tries to implement, as using pid_ns_for_children and the multiple kbuild
> >> robot emails testifies.
> >>
> >> > And in my view, a pipe program in the host's top level namespace is also
> >> > a problem.
> >> >
> >> > Let us think a container, to make it act as a real machine, when a program
> >> > panic, linux kernel should dump it into the container's filesystem.
> >> >
> >> > For the kernel, to keep the current way of forking pipe program by kthread,
> >> > just let the pipe thread running in the container's namespace, instead the
> >> host,
> >> > may solve the problem in current kernel.
> >> >
> >> > What is your opinion?
> >> >
> >> > Btw, this patch is trying to solve the problem descripted in thread named:
> >> > "piping core dump to a program escapes container" in
> >> >
> >>
> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
> >> html
> >> > Maybe using a userspace tool can make container dump to anywhere,
> >> > but for kernel ifself, it is better to solve above problem if we can.
> >>
> >> I think it would be great to find a way to run a core dump helper and
> >> otherwise allow setting the core dump pattern in a container in a way
> >> that is safe from malicious users and does not break existing setups.
> >>
> > So, there is following problem:
> > 1: safe from malicious users
> >   We can try to find a way to fork process which have no relationship
> >   with the panic process.
> > 2: Bug in patch
> >   It can be fixed, but I'd rather get a conclusion of this discussion
> >   before fix.
> > 3: Backwards compatible
> >   It maybe the biggest problem in discussion, this patch is used to let
> >   container dump files into container, it is different with current action.
> >   Before patch:
> >     File type dump_pattern: dump to container
> >     Pipe type dump_pattern: dump to host
> >   After patch:
> >     File type dump_pattern: dump to container
> >     Pipe type dump_pattern: dump to container
> >   The second design seems better but not compatible with current kernel,
> >   but this patch can not fix to keep compatible because it is the patch's
> >   function.
> >   Maybe we can make some workagound, as:
> >   a. Add a kernel config to let the old style as default.
> >   b. keep old style, and add "||" for core_pattern, as
> >     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
> >     to dump to container.
> >
> >   What is your opinion about it?
> 

Thanks for detailed answer.

> There are two parts to backwards compatibility.
> 
> 1) How should core patterns that are set outside of any container be
>    treated?
> 
>    The patchset under discussion imported core patterns set outside of
>    containers into containers completely trivially changing their
>    behavior and breaking backwards compatibility.  That is just not
>    acceptable.
> 
Before this patch, setting pattern outside container will change setting
in container.
And after this patch, host and container are separated, they have respective
setting, and no relationship except the container will copy host's setting
in create.

If we don't think compatibility, it may be a good idea, the container is only
allowed to change the core_pattern by itself.
It is strange that the core_pattern in container was changed but user/process
in container do nothing.

But just as you said, it have compatibility problem, someone maybe using
this feature to modify core_pattern in container in host.

To keep the compatibility and allow custom core_pattern in continer,
maybe we can:
1: If no process setting core_pattern in container, the container's
  core_pattern keep same copy with host.
  And if core_pattern in host changed this time, the container's pattern
  changed with host.
2: If core_pattern in container changed by some one/process in container,
  it is separated with host, and modification in host will not affect container.

What is your opinion about it?

> 2) How should core patterns inside of containers be treated?
> 
>    There are corner cases that I am not thinking of that will cause
>    regressions but I think we can reasonably assume that core patterns
>    are not set inside of containers today.  Assuming that is true,
>    then setting a core pattern inside of a container is the only thing
>    that the kernel needs to detect to handle working inside of a
>    container.
> 
>    The practical question I see for this case is which parent process
>    needs to be used for your core dump helper, and which set of
>    namespaces that parent helper has.
> 
>    I will also remind people thinking about these issues that containers
>    can be run recursisvely.
> 
Got it.
I'll try to find a way.

Thanks
Zhaolei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
  2016-03-23 11:58                               ` Zhao Lei
@ 2016-03-23 19:54                                 ` Eric W. Biederman
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-23 19:54 UTC (permalink / raw)
  To: Zhao Lei
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> Sent: Wednesday, March 23, 2016 6:43 AM
>> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
>> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
>> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
>> 
>> > Hi, Eric
>> >
>> >> -----Original Message-----
>> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> >> Sent: Tuesday, March 22, 2016 5:25 AM
>> >> To: Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> >> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
>> >> 'Mateusz Guzik' <mguzik-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; 'Kamezawa Hiroyuki'
>> >> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> >>
>> >> Zhao Lei <zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
>> >>
>> >> > Hi, Eric
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Eric W. Biederman [mailto:ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org]
>> >>
>> >> > Let me make a summarize:
>> >> > You think this way is not acceptable, because the pipe program is running
>> >> > in the panic-process's namespace context.
>> >>
>> >> Actually my view is that your patchset is not acceptable because it
>> >> is implemented in a way that is not backwards compatible (AKA it can
>> >> break existing configurations that remain unchanged) and your
>> >> implementation does not appear in the least safe from malicious users.
>> >>
>> >> There is also a problem that your patchset is simply buggy for what it
>> >> tries to implement, as using pid_ns_for_children and the multiple kbuild
>> >> robot emails testifies.
>> >>
>> >> > And in my view, a pipe program in the host's top level namespace is also
>> >> > a problem.
>> >> >
>> >> > Let us think a container, to make it act as a real machine, when a program
>> >> > panic, linux kernel should dump it into the container's filesystem.
>> >> >
>> >> > For the kernel, to keep the current way of forking pipe program by kthread,
>> >> > just let the pipe thread running in the container's namespace, instead the
>> >> host,
>> >> > may solve the problem in current kernel.
>> >> >
>> >> > What is your opinion?
>> >> >
>> >> > Btw, this patch is trying to solve the problem descripted in thread named:
>> >> > "piping core dump to a program escapes container" in
>> >> >
>> >>
>> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
>> >> html
>> >> > Maybe using a userspace tool can make container dump to anywhere,
>> >> > but for kernel ifself, it is better to solve above problem if we can.
>> >>
>> >> I think it would be great to find a way to run a core dump helper and
>> >> otherwise allow setting the core dump pattern in a container in a way
>> >> that is safe from malicious users and does not break existing setups.
>> >>
>> > So, there is following problem:
>> > 1: safe from malicious users
>> >   We can try to find a way to fork process which have no relationship
>> >   with the panic process.
>> > 2: Bug in patch
>> >   It can be fixed, but I'd rather get a conclusion of this discussion
>> >   before fix.
>> > 3: Backwards compatible
>> >   It maybe the biggest problem in discussion, this patch is used to let
>> >   container dump files into container, it is different with current action.
>> >   Before patch:
>> >     File type dump_pattern: dump to container
>> >     Pipe type dump_pattern: dump to host
>> >   After patch:
>> >     File type dump_pattern: dump to container
>> >     Pipe type dump_pattern: dump to container
>> >   The second design seems better but not compatible with current kernel,
>> >   but this patch can not fix to keep compatible because it is the patch's
>> >   function.
>> >   Maybe we can make some workagound, as:
>> >   a. Add a kernel config to let the old style as default.
>> >   b. keep old style, and add "||" for core_pattern, as
>> >     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
>> >     to dump to container.
>> >
>> >   What is your opinion about it?
>> 
>
> Thanks for detailed answer.
>
>> There are two parts to backwards compatibility.
>> 
>> 1) How should core patterns that are set outside of any container be
>>    treated?
>> 
>>    The patchset under discussion imported core patterns set outside of
>>    containers into containers completely trivially changing their
>>    behavior and breaking backwards compatibility.  That is just not
>>    acceptable.
>> 
> Before this patch, setting pattern outside container will change setting
> in container.
> And after this patch, host and container are separated, they have respective
> setting, and no relationship except the container will copy host's setting
> in create.
>
> If we don't think compatibility, it may be a good idea, the container is only
> allowed to change the core_pattern by itself.
> It is strange that the core_pattern in container was changed but user/process
> in container do nothing.
>
> But just as you said, it have compatibility problem, someone maybe using
> this feature to modify core_pattern in container in host.
>
> To keep the compatibility and allow custom core_pattern in continer,
> maybe we can:
> 1: If no process setting core_pattern in container, the container's
>   core_pattern keep same copy with host.
>   And if core_pattern in host changed this time, the container's pattern
>   changed with host.
> 2: If core_pattern in container changed by some one/process in container,
>   it is separated with host, and modification in host will not affect container.
>
> What is your opinion about it?

That sounds correct.  Use the core_pattern for the container if it
exists otherwise use a core_pattern for an outer container/host.

>> 2) How should core patterns inside of containers be treated?
>> 
>>    There are corner cases that I am not thinking of that will cause
>>    regressions but I think we can reasonably assume that core patterns
>>    are not set inside of containers today.  Assuming that is true,
>>    then setting a core pattern inside of a container is the only thing
>>    that the kernel needs to detect to handle working inside of a
>>    container.
>> 
>>    The practical question I see for this case is which parent process
>>    needs to be used for your core dump helper, and which set of
>>    namespaces that parent helper has.
>> 
>>    I will also remind people thinking about these issues that containers
>>    can be run recursisvely.
>> 
> Got it.
> I'll try to find a way.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] Make core_pattern support namespace
@ 2016-03-23 19:54                                 ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2016-03-23 19:54 UTC (permalink / raw)
  To: Zhao Lei
  Cc: linux-kernel, containers, 'Mateusz Guzik',
	'Kamezawa Hiroyuki'

Zhao Lei <zhaolei@cn.fujitsu.com> writes:

> Hi, Eric
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> Sent: Wednesday, March 23, 2016 6:43 AM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
>> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
>> <kamezawa.hiroyu@jp.fujitsu.com>
>> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> 
>> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
>> 
>> > Hi, Eric
>> >
>> >> -----Original Message-----
>> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> >> Sent: Tuesday, March 22, 2016 5:25 AM
>> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> Cc: linux-kernel@vger.kernel.org; containers@lists.linux-foundation.org;
>> >> 'Mateusz Guzik' <mguzik@redhat.com>; 'Kamezawa Hiroyuki'
>> >> <kamezawa.hiroyu@jp.fujitsu.com>
>> >> Subject: Re: [PATCH v2 3/3] Make core_pattern support namespace
>> >>
>> >> Zhao Lei <zhaolei@cn.fujitsu.com> writes:
>> >>
>> >> > Hi, Eric
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> >>
>> >> > Let me make a summarize:
>> >> > You think this way is not acceptable, because the pipe program is running
>> >> > in the panic-process's namespace context.
>> >>
>> >> Actually my view is that your patchset is not acceptable because it
>> >> is implemented in a way that is not backwards compatible (AKA it can
>> >> break existing configurations that remain unchanged) and your
>> >> implementation does not appear in the least safe from malicious users.
>> >>
>> >> There is also a problem that your patchset is simply buggy for what it
>> >> tries to implement, as using pid_ns_for_children and the multiple kbuild
>> >> robot emails testifies.
>> >>
>> >> > And in my view, a pipe program in the host's top level namespace is also
>> >> > a problem.
>> >> >
>> >> > Let us think a container, to make it act as a real machine, when a program
>> >> > panic, linux kernel should dump it into the container's filesystem.
>> >> >
>> >> > For the kernel, to keep the current way of forking pipe program by kthread,
>> >> > just let the pipe thread running in the container's namespace, instead the
>> >> host,
>> >> > may solve the problem in current kernel.
>> >> >
>> >> > What is your opinion?
>> >> >
>> >> > Btw, this patch is trying to solve the problem descripted in thread named:
>> >> > "piping core dump to a program escapes container" in
>> >> >
>> >>
>> http://lists.linuxfoundation.org/pipermail/containers/2015-December/036476.
>> >> html
>> >> > Maybe using a userspace tool can make container dump to anywhere,
>> >> > but for kernel ifself, it is better to solve above problem if we can.
>> >>
>> >> I think it would be great to find a way to run a core dump helper and
>> >> otherwise allow setting the core dump pattern in a container in a way
>> >> that is safe from malicious users and does not break existing setups.
>> >>
>> > So, there is following problem:
>> > 1: safe from malicious users
>> >   We can try to find a way to fork process which have no relationship
>> >   with the panic process.
>> > 2: Bug in patch
>> >   It can be fixed, but I'd rather get a conclusion of this discussion
>> >   before fix.
>> > 3: Backwards compatible
>> >   It maybe the biggest problem in discussion, this patch is used to let
>> >   container dump files into container, it is different with current action.
>> >   Before patch:
>> >     File type dump_pattern: dump to container
>> >     Pipe type dump_pattern: dump to host
>> >   After patch:
>> >     File type dump_pattern: dump to container
>> >     Pipe type dump_pattern: dump to container
>> >   The second design seems better but not compatible with current kernel,
>> >   but this patch can not fix to keep compatible because it is the patch's
>> >   function.
>> >   Maybe we can make some workagound, as:
>> >   a. Add a kernel config to let the old style as default.
>> >   b. keep old style, and add "||" for core_pattern, as
>> >     echo "|| /root/container_dumper" >/proc/sys/kernel/core_pattern
>> >     to dump to container.
>> >
>> >   What is your opinion about it?
>> 
>
> Thanks for detailed answer.
>
>> There are two parts to backwards compatibility.
>> 
>> 1) How should core patterns that are set outside of any container be
>>    treated?
>> 
>>    The patchset under discussion imported core patterns set outside of
>>    containers into containers completely trivially changing their
>>    behavior and breaking backwards compatibility.  That is just not
>>    acceptable.
>> 
> Before this patch, setting pattern outside container will change setting
> in container.
> And after this patch, host and container are separated, they have respective
> setting, and no relationship except the container will copy host's setting
> in create.
>
> If we don't think compatibility, it may be a good idea, the container is only
> allowed to change the core_pattern by itself.
> It is strange that the core_pattern in container was changed but user/process
> in container do nothing.
>
> But just as you said, it have compatibility problem, someone maybe using
> this feature to modify core_pattern in container in host.
>
> To keep the compatibility and allow custom core_pattern in continer,
> maybe we can:
> 1: If no process setting core_pattern in container, the container's
>   core_pattern keep same copy with host.
>   And if core_pattern in host changed this time, the container's pattern
>   changed with host.
> 2: If core_pattern in container changed by some one/process in container,
>   it is separated with host, and modification in host will not affect container.
>
> What is your opinion about it?

That sounds correct.  Use the core_pattern for the container if it
exists otherwise use a core_pattern for an outer container/host.

>> 2) How should core patterns inside of containers be treated?
>> 
>>    There are corner cases that I am not thinking of that will cause
>>    regressions but I think we can reasonably assume that core patterns
>>    are not set inside of containers today.  Assuming that is true,
>>    then setting a core pattern inside of a container is the only thing
>>    that the kernel needs to detect to handle working inside of a
>>    container.
>> 
>>    The practical question I see for this case is which parent process
>>    needs to be used for your core dump helper, and which set of
>>    namespaces that parent helper has.
>> 
>>    I will also remind people thinking about these issues that containers
>>    can be run recursisvely.
>> 
> Got it.
> I'll try to find a way.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-03-23 20:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-18 12:48 [PATCH v2 0/3] Make core_pattern support namespace Zhao Lei
2016-03-18 12:48 ` Zhao Lei
     [not found] ` <cover.1458305141.git.zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2016-03-18 12:48   ` [PATCH v2 1/3] Make _do_fork support return to caller's code Zhao Lei
2016-03-18 12:48     ` Zhao Lei
2016-03-18 12:48   ` [PATCH v2 2/3] Run dump pipe in container's namespace Zhao Lei
2016-03-18 12:48     ` Zhao Lei
     [not found]     ` <a1ac975a165b63c8ee67f9785e50067be41a74ff.1458305141.git.zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2016-03-18 14:06       ` kbuild test robot
2016-03-18 14:06         ` kbuild test robot
2016-03-18 12:48   ` [PATCH v2 3/3] Make core_pattern support namespace Zhao Lei
2016-03-18 12:48     ` Zhao Lei
     [not found]     ` <77053bb2bdd21489e09b6ef362044d283e1ba12b.1458305141.git.zhaolei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2016-03-21  6:00       ` Eric W. Biederman
2016-03-21  6:00         ` Eric W. Biederman
     [not found]         ` <87twk0tlok.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-03-21  7:16           ` Zhao Lei
2016-03-21  7:16             ` Zhao Lei
2016-03-21  8:14             ` Eric W. Biederman
2016-03-21  8:14               ` Eric W. Biederman
     [not found]               ` <87shzkqmc8.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-03-21 10:09                 ` Zhao Lei
2016-03-21 10:09                   ` Zhao Lei
2016-03-21 21:24                   ` Eric W. Biederman
2016-03-21 21:24                     ` Eric W. Biederman
     [not found]                     ` <878u1bo772.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-03-22  1:38                       ` Zhao Lei
2016-03-22  1:38                         ` Zhao Lei
2016-03-22 22:42                         ` Eric W. Biederman
2016-03-22 22:42                           ` Eric W. Biederman
     [not found]                           ` <87lh5am8wu.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-03-23 11:58                             ` Zhao Lei
2016-03-23 11:58                               ` Zhao Lei
2016-03-23 19:54                               ` Eric W. Biederman
2016-03-23 19:54                                 ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.