linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] per signal_struct coredumps
@ 2021-09-24  0:08 Eric W. Biederman
  2021-09-24  0:09 ` [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop Eric W. Biederman
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Current coredumps are mixed up with the exit code, the signal handling
code and with the ptrace code in was they are much more complicated than
necessary and difficult to follow.

This series of changes starts with ptrace_stop and cleans it up,
making it easier to follow what is happening in ptrace_stop.
Then cleans up the exec interactions with coredumps.
Then cleans up the coredump interactions with exit.
Then the coredump interactions with the signal handling code is clean
up.

The first and last changes are bug fixes for minor bugs.

I believe the fact that vfork followed by execve can kill the process
the called vfork if exec fails is sufficient justification to change
the userspace visible behavior.

In previous conversations it was suggested that some of these cleanups
did not stand on their own.  I think I have managed to organize things
so all of their patches stand on their own.

Which means that if for some reason the last change needs to be reverted
we can still keep the gains from the other changes.


Eric W. Biederman (6):
      signal: Remove the bogus sigkill_pending in ptrace_stop
      ptrace: Remove the unnecessary arguments from arch_ptrace_stop
      exec: Check for a pending fatal signal instead of core_state
      exit: Factor coredump_exit_mm out of exit_mm
      coredump:  Don't perform any cleanups before dumping core
      coredump: Limit coredumps to a single thread group

 arch/ia64/include/asm/ptrace.h  |  4 +-
 arch/sparc/include/asm/ptrace.h |  8 ++--
 fs/binfmt_elf.c                 |  4 +-
 fs/binfmt_elf_fdpic.c           |  2 +-
 fs/coredump.c                   | 88 ++++++-----------------------------------
 fs/exec.c                       | 14 +++----
 fs/proc/array.c                 |  6 +--
 include/linux/mm_types.h        | 13 ------
 include/linux/ptrace.h          | 22 +++++------
 include/linux/sched.h           |  1 +
 include/linux/sched/signal.h    | 13 ++++++
 kernel/exit.c                   | 76 +++++++++++++++++++----------------
 kernel/fork.c                   |  4 +-
 kernel/signal.c                 | 49 ++++-------------------
 mm/oom_kill.c                   |  6 +--
 15 files changed, 104 insertions(+), 206 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
@ 2021-09-24  0:09 ` Eric W. Biederman
  2021-09-24 15:22   ` Kees Cook
  2021-09-24  0:10 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


The existence of sigkill_pending is a little silly as it is
functionally a duplicate of fatal_signal_pending that is used in
exactly one place.

Checking for pending fatal signals and returning early in ptrace_stop
is actively harmful.  It casues the ptrace_stop called by
ptrace_signal to return early before setting current->exit_code.
Later when ptrace_signal reads the signal number from
current->exit_code is undefined, making it unpredictable what will
happen.

Instead rely on the fact that schedule will not sleep if there is a
pending signal that can awaken a task.

Removing the explict sigkill_pending test fixes fixes ptrace_signal
when ptrace_stop does not stop because current->exit_code is always
set to to signr.

Cc: stable@vger.kernel.org
Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/signal.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 952741f6d0f9..9f2dc9cf3208 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2182,15 +2182,6 @@ static inline bool may_ptrace_stop(void)
 	return true;
 }
 
-/*
- * Return non-zero if there is a SIGKILL that should be waking us up.
- * Called with the siglock held.
- */
-static bool sigkill_pending(struct task_struct *tsk)
-{
-	return sigismember(&tsk->pending.signal, SIGKILL) ||
-	       sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
-}
 
 /*
  * This must be called with current->sighand->siglock held.
@@ -2217,17 +2208,16 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 		 * calling arch_ptrace_stop, so we must release it now.
 		 * To preserve proper semantics, we must do this before
 		 * any signal bookkeeping like checking group_stop_count.
-		 * Meanwhile, a SIGKILL could come in before we retake the
-		 * siglock.  That must prevent us from sleeping in TASK_TRACED.
-		 * So after regaining the lock, we must check for SIGKILL.
 		 */
 		spin_unlock_irq(&current->sighand->siglock);
 		arch_ptrace_stop(exit_code, info);
 		spin_lock_irq(&current->sighand->siglock);
-		if (sigkill_pending(current))
-			return;
 	}
 
+	/*
+	 * schedule() will not sleep if there is a pending signal that
+	 * can awaken the task.
+	 */
 	set_special_state(TASK_TRACED);
 
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
  2021-09-24  0:09 ` [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop Eric W. Biederman
@ 2021-09-24  0:10 ` Eric W. Biederman
  2021-09-24 15:26   ` Kees Cook
  2021-09-24  0:10 ` [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state Eric W. Biederman
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
exit_code and a siginfo structure.  Neither argument is used by any of
the implementations so just remove the unneeded arguments.

The two arechitectures that implement arch_ptrace_stop are ia64 and
sparc.  Both architectures flush their register stacks before a
ptrace_stack so that all of the register information can be accessed
by debuggers.

As the question of if a register stack needs to be flushed is
independent of why ptrace is stopping not needing arguments make sense.

Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/ia64/include/asm/ptrace.h  |  4 ++--
 arch/sparc/include/asm/ptrace.h |  8 ++++----
 include/linux/ptrace.h          | 22 +++++++++-------------
 kernel/signal.c                 |  4 ++--
 4 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/include/asm/ptrace.h b/arch/ia64/include/asm/ptrace.h
index 08179135905c..f15504f75f10 100644
--- a/arch/ia64/include/asm/ptrace.h
+++ b/arch/ia64/include/asm/ptrace.h
@@ -129,9 +129,9 @@ static inline long regs_return_value(struct pt_regs *regs)
   extern void ia64_decrement_ip (struct pt_regs *pt);
 
   extern void ia64_ptrace_stop(void);
-  #define arch_ptrace_stop(code, info) \
+  #define arch_ptrace_stop() \
 	ia64_ptrace_stop()
-  #define arch_ptrace_stop_needed(code, info) \
+  #define arch_ptrace_stop_needed() \
 	(!test_thread_flag(TIF_RESTORE_RSE))
 
   extern void ptrace_attach_sync_user_rbs (struct task_struct *);
diff --git a/arch/sparc/include/asm/ptrace.h b/arch/sparc/include/asm/ptrace.h
index 71dd82b43cc5..d1419e669027 100644
--- a/arch/sparc/include/asm/ptrace.h
+++ b/arch/sparc/include/asm/ptrace.h
@@ -26,12 +26,12 @@ static inline bool pt_regs_clear_syscall(struct pt_regs *regs)
 	return (regs->tstate &= ~TSTATE_SYSCALL);
 }
 
-#define arch_ptrace_stop_needed(exit_code, info) \
+#define arch_ptrace_stop_needed() \
 ({	flush_user_windows(); \
 	get_thread_wsaved() != 0; \
 })
 
-#define arch_ptrace_stop(exit_code, info) \
+#define arch_ptrace_stop() \
 	synchronize_user_stack()
 
 #define current_pt_regs() \
@@ -129,12 +129,12 @@ static inline bool pt_regs_clear_syscall(struct pt_regs *regs)
 	return (regs->psr &= ~PSR_SYSCALL);
 }
 
-#define arch_ptrace_stop_needed(exit_code, info) \
+#define arch_ptrace_stop_needed() \
 ({	flush_user_windows(); \
 	current_thread_info()->w_saved != 0;	\
 })
 
-#define arch_ptrace_stop(exit_code, info) \
+#define arch_ptrace_stop() \
 	synchronize_user_stack()
 
 #define current_pt_regs() \
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index b5ebf6c01292..8aee2945ff08 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -362,29 +362,25 @@ static inline void user_single_step_report(struct pt_regs *regs)
 #ifndef arch_ptrace_stop_needed
 /**
  * arch_ptrace_stop_needed - Decide whether arch_ptrace_stop() should be called
- * @code:	current->exit_code value ptrace will stop with
- * @info:	siginfo_t pointer (or %NULL) for signal ptrace will stop with
  *
  * This is called with the siglock held, to decide whether or not it's
- * necessary to release the siglock and call arch_ptrace_stop() with the
- * same @code and @info arguments.  It can be defined to a constant if
- * arch_ptrace_stop() is never required, or always is.  On machines where
- * this makes sense, it should be defined to a quick test to optimize out
- * calling arch_ptrace_stop() when it would be superfluous.  For example,
- * if the thread has not been back to user mode since the last stop, the
- * thread state might indicate that nothing needs to be done.
+ * necessary to release the siglock and call arch_ptrace_stop().  It can be
+ * defined to a constant if arch_ptrace_stop() is never required, or always
+ * is.  On machines where this makes sense, it should be defined to a quick
+ * test to optimize out calling arch_ptrace_stop() when it would be
+ * superfluous.  For example, if the thread has not been back to user mode
+ * since the last stop, the thread state might indicate that nothing needs
+ * to be done.
  *
  * This is guaranteed to be invoked once before a task stops for ptrace and
  * may include arch-specific operations necessary prior to a ptrace stop.
  */
-#define arch_ptrace_stop_needed(code, info)	(0)
+#define arch_ptrace_stop_needed()	(0)
 #endif
 
 #ifndef arch_ptrace_stop
 /**
  * arch_ptrace_stop - Do machine-specific work before stopping for ptrace
- * @code:	current->exit_code value ptrace will stop with
- * @info:	siginfo_t pointer (or %NULL) for signal ptrace will stop with
  *
  * This is called with no locks held when arch_ptrace_stop_needed() has
  * just returned nonzero.  It is allowed to block, e.g. for user memory
@@ -394,7 +390,7 @@ static inline void user_single_step_report(struct pt_regs *regs)
  * we only do it when the arch requires it for this particular stop, as
  * indicated by arch_ptrace_stop_needed().
  */
-#define arch_ptrace_stop(code, info)		do { } while (0)
+#define arch_ptrace_stop()		do { } while (0)
 #endif
 
 #ifndef current_pt_regs
diff --git a/kernel/signal.c b/kernel/signal.c
index 9f2dc9cf3208..c9759ff2cb43 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2200,7 +2200,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 {
 	bool gstop_done = false;
 
-	if (arch_ptrace_stop_needed(exit_code, info)) {
+	if (arch_ptrace_stop_needed()) {
 		/*
 		 * The arch code has something special to do before a
 		 * ptrace stop.  This is allowed to block, e.g. for faults
@@ -2210,7 +2210,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 		 * any signal bookkeeping like checking group_stop_count.
 		 */
 		spin_unlock_irq(&current->sighand->siglock);
-		arch_ptrace_stop(exit_code, info);
+		arch_ptrace_stop();
 		spin_lock_irq(&current->sighand->siglock);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
  2021-09-24  0:09 ` [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop Eric W. Biederman
  2021-09-24  0:10 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
@ 2021-09-24  0:10 ` Eric W. Biederman
  2021-09-24 15:38   ` Kees Cook
  2021-09-24  0:11 ` [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm Eric W. Biederman
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Prevent exec continuing when a fatal signal is pending by replacing
mmap_read_lock with mmap_read_lock_killable.  This is always the right
thing to do as userspace will never observe an exec complete when
there is a fatal signal pending.

With that change it becomes unnecessary to explicitly test for a core
dump in progress.  In coredump_wait zap_threads arranges under
mmap_write_lock for all tasks that use a mm to also have SIGKILL
pending, which means mmap_read_lock_killable will always return -EINTR
when old_mm->core_state is present.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..b6079f1a098e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -987,16 +987,14 @@ static int exec_mmap(struct mm_struct *mm)
 
 	if (old_mm) {
 		/*
-		 * Make sure that if there is a core dump in progress
-		 * for the old mm, we get out and die instead of going
-		 * through with the exec.  We must hold mmap_lock around
-		 * checking core_state and changing tsk->mm.
+		 * If there is a pending fatal signal perhaps a signal
+		 * whose default action is to create a coredump get
+		 * out and die instead of going through with the exec.
 		 */
-		mmap_read_lock(old_mm);
-		if (unlikely(old_mm->core_state)) {
-			mmap_read_unlock(old_mm);
+		ret = mmap_read_lock_killable(old_mm);
+		if (ret) {
 			up_write(&tsk->signal->exec_update_lock);
-			return -EINTR;
+			return ret;
 		}
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
                   ` (2 preceding siblings ...)
  2021-09-24  0:10 ` [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state Eric W. Biederman
@ 2021-09-24  0:11 ` Eric W. Biederman
  2021-09-24 18:28   ` Kees Cook
  2021-09-24  0:11 ` [PATCH 5/6] coredump: Don't perform any cleanups before dumping core Eric W. Biederman
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Separate the coredump logic from the ordinary exit_mm logic
by moving the coredump logic out of exit_mm into it's own
function coredump_exit_mm.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c |  6 ++--
 kernel/exit.c | 76 +++++++++++++++++++++++++++------------------------
 mm/oom_kill.c |  6 ++--
 3 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 3224dee44d30..5e0e08a7fb9b 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -404,8 +404,8 @@ static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
 	 *
 	 * do_exit:
 	 *	The caller holds mm->mmap_lock. This means that the task which
-	 *	uses this mm can't pass exit_mm(), so it can't exit or clear
-	 *	its ->mm.
+	 *	uses this mm can't pass coredump_exit_mm(), so it can't exit or
+	 *	clear its ->mm.
 	 *
 	 * de_thread:
 	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
@@ -500,7 +500,7 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
 		next = curr->next;
 		task = curr->task;
 		/*
-		 * see exit_mm(), curr->task must not see
+		 * see coredump_exit_mm(), curr->task must not see
 		 * ->task == NULL before we read ->next.
 		 */
 		smp_mb();
diff --git a/kernel/exit.c b/kernel/exit.c
index 91a43e57a32e..cb1619d8fd64 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -339,6 +339,46 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	}
 }
 
+static void coredump_exit_mm(struct mm_struct *mm)
+{
+	struct core_state *core_state;
+
+	/*
+	 * Serialize with any possible pending coredump.
+	 * We must hold mmap_lock around checking core_state
+	 * and clearing tsk->mm.  The core-inducing thread
+	 * will increment ->nr_threads for each thread in the
+	 * group with ->mm != NULL.
+	 */
+	core_state = mm->core_state;
+	if (core_state) {
+		struct core_thread self;
+
+		mmap_read_unlock(mm);
+
+		self.task = current;
+		if (self.task->flags & PF_SIGNALED)
+			self.next = xchg(&core_state->dumper.next, &self);
+		else
+			self.task = NULL;
+		/*
+		 * Implies mb(), the result of xchg() must be visible
+		 * to core_state->dumper.
+		 */
+		if (atomic_dec_and_test(&core_state->nr_threads))
+			complete(&core_state->startup);
+
+		for (;;) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!self.task) /* see coredump_finish() */
+				break;
+			freezable_schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+		mmap_read_lock(mm);
+	}
+}
+
 #ifdef CONFIG_MEMCG
 /*
  * A task is exiting.   If it owned this mm, find a new owner for the mm.
@@ -434,47 +474,13 @@ void mm_update_next_owner(struct mm_struct *mm)
 static void exit_mm(void)
 {
 	struct mm_struct *mm = current->mm;
-	struct core_state *core_state;
 
 	exit_mm_release(current, mm);
 	if (!mm)
 		return;
 	sync_mm_rss(mm);
-	/*
-	 * Serialize with any possible pending coredump.
-	 * We must hold mmap_lock around checking core_state
-	 * and clearing tsk->mm.  The core-inducing thread
-	 * will increment ->nr_threads for each thread in the
-	 * group with ->mm != NULL.
-	 */
 	mmap_read_lock(mm);
-	core_state = mm->core_state;
-	if (core_state) {
-		struct core_thread self;
-
-		mmap_read_unlock(mm);
-
-		self.task = current;
-		if (self.task->flags & PF_SIGNALED)
-			self.next = xchg(&core_state->dumper.next, &self);
-		else
-			self.task = NULL;
-		/*
-		 * Implies mb(), the result of xchg() must be visible
-		 * to core_state->dumper.
-		 */
-		if (atomic_dec_and_test(&core_state->nr_threads))
-			complete(&core_state->startup);
-
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!self.task) /* see coredump_finish() */
-				break;
-			freezable_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
-		mmap_read_lock(mm);
-	}
+	coredump_exit_mm(mm);
 	mmgrab(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..295c8bdfd6c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -787,9 +787,9 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 	struct signal_struct *sig = task->signal;
 
 	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
+	 * A coredumping process may sleep for an extended period in
+	 * coredump_exit_mm(), so the oom killer cannot assume that
+	 * the process will promptly exit and release memory.
 	 */
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
 		return false;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] coredump:  Don't perform any cleanups before dumping core
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
                   ` (3 preceding siblings ...)
  2021-09-24  0:11 ` [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm Eric W. Biederman
@ 2021-09-24  0:11 ` Eric W. Biederman
  2021-09-24 18:51   ` Kees Cook
  2021-09-24  0:12 ` [PATCH 6/6] coredump: Limit coredumps to a single thread group Eric W. Biederman
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
before PTRACE_EVENT_EXIT, and before any cleanup work for a task
happens.  This ensures that an accurate copy of the process can be
captured in the coredump as no cleanup for the process happens before
the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
will not be visited by any thread until the coredump is complete.

Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
coredump_task_exit can be recognized and ignored in zap_process.

Now that all of the coredumping happens before exit_mm remove code to
test for a coredump in progress from mm_release.

Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
The other tests in may_ptrace_stop all concern avoiding stopping
during a coredump.  These tests are no longer necessary as it is now
guaranteed that fatal_signal_pending will be set if the code enters
ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
not to stop if fatal_signal_pending returns true.

Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
ptrace_stop without fatal_signal_pending being true, as signals are
dequeued in get_signal before calling do_exit.  This is no longer
an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
until after the coredump completes.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c         |  8 ++++----
 include/linux/sched.h |  1 +
 kernel/exit.c         | 19 ++++++++++++-------
 kernel/fork.c         |  3 +--
 kernel/signal.c       | 27 +--------------------------
 mm/oom_kill.c         |  2 +-
 6 files changed, 20 insertions(+), 40 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 5e0e08a7fb9b..d576287fb88b 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -359,7 +359,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
 
 	for_each_thread(start, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		if (t != current && t->mm) {
+		if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
 			sigaddset(&t->pending.signal, SIGKILL);
 			signal_wake_up(t, 1);
 			nr++;
@@ -404,8 +404,8 @@ static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
 	 *
 	 * do_exit:
 	 *	The caller holds mm->mmap_lock. This means that the task which
-	 *	uses this mm can't pass coredump_exit_mm(), so it can't exit or
-	 *	clear its ->mm.
+	 *	uses this mm can't pass coredump_task_exit(), so it can't exit
+	 *	or clear its ->mm.
 	 *
 	 * de_thread:
 	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
@@ -500,7 +500,7 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
 		next = curr->next;
 		task = curr->task;
 		/*
-		 * see coredump_exit_mm(), curr->task must not see
+		 * see coredump_task_exit(), curr->task must not see
 		 * ->task == NULL before we read ->next.
 		 */
 		smp_mb();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e12b524426b0..f3741f23935e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1664,6 +1664,7 @@ extern struct pid *cad_pid;
 #define PF_VCPU			0x00000001	/* I'm a virtual CPU */
 #define PF_IDLE			0x00000002	/* I am an IDLE thread */
 #define PF_EXITING		0x00000004	/* Getting shut down */
+#define PF_POSTCOREDUMP		0x00000008	/* Coredumps should ignore this task */
 #define PF_IO_WORKER		0x00000010	/* Task is an IO worker */
 #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */
diff --git a/kernel/exit.c b/kernel/exit.c
index cb1619d8fd64..774e6b5061b8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -339,23 +339,29 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	}
 }
 
-static void coredump_exit_mm(struct mm_struct *mm)
+static void coredump_task_exit(struct task_struct *tsk)
 {
 	struct core_state *core_state;
+	struct mm_struct *mm;
+
+	mm = tsk->mm;
+	if (!mm)
+		return;
 
 	/*
 	 * Serialize with any possible pending coredump.
 	 * We must hold mmap_lock around checking core_state
-	 * and clearing tsk->mm.  The core-inducing thread
+	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
 	 * will increment ->nr_threads for each thread in the
-	 * group with ->mm != NULL.
+	 * group without PF_POSTCOREDUMP set.
 	 */
+	mmap_read_lock(mm);
+	tsk->flags |= PF_POSTCOREDUMP;
 	core_state = mm->core_state;
+	mmap_read_unlock(mm);
 	if (core_state) {
 		struct core_thread self;
 
-		mmap_read_unlock(mm);
-
 		self.task = current;
 		if (self.task->flags & PF_SIGNALED)
 			self.next = xchg(&core_state->dumper.next, &self);
@@ -375,7 +381,6 @@ static void coredump_exit_mm(struct mm_struct *mm)
 			freezable_schedule();
 		}
 		__set_current_state(TASK_RUNNING);
-		mmap_read_lock(mm);
 	}
 }
 
@@ -480,7 +485,6 @@ static void exit_mm(void)
 		return;
 	sync_mm_rss(mm);
 	mmap_read_lock(mm);
-	coredump_exit_mm(mm);
 	mmgrab(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
@@ -768,6 +772,7 @@ void __noreturn do_exit(long code)
 	profile_task_exit(tsk);
 	kcov_task_exit(tsk);
 
+	coredump_task_exit(tsk);
 	ptrace_event(PTRACE_EVENT_EXIT, code);
 
 	validate_creds_for_do_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..9bd9f2da9e41 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1392,8 +1392,7 @@ static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 	 * purposes.
 	 */
 	if (tsk->clear_child_tid) {
-		if (!(tsk->signal->flags & SIGNAL_GROUP_COREDUMP) &&
-		    atomic_read(&mm->mm_users) > 1) {
+		if (atomic_read(&mm->mm_users) > 1) {
 			/*
 			 * We don't check the error code - if userspace has
 			 * not set up a proper pointer then tough luck.
diff --git a/kernel/signal.c b/kernel/signal.c
index c9759ff2cb43..b0db80acc6ef 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2158,31 +2158,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
-static inline bool may_ptrace_stop(void)
-{
-	if (!likely(current->ptrace))
-		return false;
-	/*
-	 * Are we in the middle of do_coredump?
-	 * If so and our tracer is also part of the coredump stopping
-	 * is a deadlock situation, and pointless because our tracer
-	 * is dead so don't allow us to stop.
-	 * If SIGKILL was already sent before the caller unlocked
-	 * ->siglock we must see ->core_state != NULL. Otherwise it
-	 * is safe to enter schedule().
-	 *
-	 * This is almost outdated, a task with the pending SIGKILL can't
-	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
-	 * after SIGKILL was already dequeued.
-	 */
-	if (unlikely(current->mm->core_state) &&
-	    unlikely(current->mm == current->parent->mm))
-		return false;
-
-	return true;
-}
-
-
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2263,7 +2238,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 
 	spin_unlock_irq(&current->sighand->siglock);
 	read_lock(&tasklist_lock);
-	if (may_ptrace_stop()) {
+	if (likely(current->ptrace)) {
 		/*
 		 * Notify parents of the stop.
 		 *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 295c8bdfd6c8..7877c755ab37 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -788,7 +788,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 
 	/*
 	 * A coredumping process may sleep for an extended period in
-	 * coredump_exit_mm(), so the oom killer cannot assume that
+	 * coredump_task_exit(), so the oom killer cannot assume that
 	 * the process will promptly exit and release memory.
 	 */
 	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] coredump: Limit coredumps to a single thread group
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
                   ` (4 preceding siblings ...)
  2021-09-24  0:11 ` [PATCH 5/6] coredump: Don't perform any cleanups before dumping core Eric W. Biederman
@ 2021-09-24  0:12 ` Eric W. Biederman
  2021-09-24 18:56   ` Kees Cook
  2021-11-19 16:03   ` Kyle Huey
  2021-09-24  5:59 ` [PATCH 0/6] per signal_struct coredumps Kees Cook
  2021-09-24 15:22 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
  7 siblings, 2 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24  0:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook


Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.

In the case of vfork this looks like a real world problem.  Consider
the following well defined sequence.

	if (vfork() == 0) {
		execve(...);
		_exit(EXIT_FAILURE);
	}

If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).

Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.

As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.

Solve this by making a userspace visible change to only kill a single
process/thread group.  This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening.  With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.

To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.

In zap_threads remove the work to kill anything except for the current
thread group.

[1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/binfmt_elf.c              |  4 +-
 fs/binfmt_elf_fdpic.c        |  2 +-
 fs/coredump.c                | 84 ++++--------------------------------
 fs/proc/array.c              |  6 +--
 include/linux/mm_types.h     | 13 ------
 include/linux/sched/signal.h | 13 ++++++
 kernel/exit.c                | 13 ++----
 kernel/fork.c                |  1 -
 8 files changed, 32 insertions(+), 104 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 69d900a8473d..796e5327ee7d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1834,7 +1834,7 @@ static int fill_note_info(struct elfhdr *elf, int phdrs,
 	/*
 	 * Allocate a structure for each thread.
 	 */
-	for (ct = &dump_task->mm->core_state->dumper; ct; ct = ct->next) {
+	for (ct = &dump_task->signal->core_state->dumper; ct; ct = ct->next) {
 		t = kzalloc(offsetof(struct elf_thread_core_info,
 				     notes[info->thread_notes]),
 			    GFP_KERNEL);
@@ -2024,7 +2024,7 @@ static int fill_note_info(struct elfhdr *elf, int phdrs,
 	if (!elf_note_info_init(info))
 		return 0;
 
-	for (ct = current->mm->core_state->dumper.next;
+	for (ct = current->signal->core_state->dumper.next;
 					ct; ct = ct->next) {
 		ets = kzalloc(sizeof(*ets), GFP_KERNEL);
 		if (!ets)
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 6d8fd6030cbb..c6f588dc4a9d 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1494,7 +1494,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, &vma_data_size))
 		goto end_coredump;
 
-	for (ct = current->mm->core_state->dumper.next;
+	for (ct = current->signal->core_state->dumper.next;
 					ct; ct = ct->next) {
 		tmp = elf_dump_thread_status(cprm->siginfo->si_signo,
 					     ct->task, &thread_status_size);
diff --git a/fs/coredump.c b/fs/coredump.c
index d576287fb88b..a6b3c196cdef 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -369,99 +369,34 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
 	return nr;
 }
 
-static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
+static int zap_threads(struct task_struct *tsk,
 			struct core_state *core_state, int exit_code)
 {
-	struct task_struct *g, *p;
-	unsigned long flags;
 	int nr = -EAGAIN;
 
 	spin_lock_irq(&tsk->sighand->siglock);
 	if (!signal_group_exit(tsk->signal)) {
-		mm->core_state = core_state;
+		tsk->signal->core_state = core_state;
 		tsk->signal->group_exit_task = tsk;
 		nr = zap_process(tsk, exit_code, 0);
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+		tsk->flags |= PF_DUMPCORE;
+		atomic_set(&core_state->nr_threads, nr);
 	}
 	spin_unlock_irq(&tsk->sighand->siglock);
-	if (unlikely(nr < 0))
-		return nr;
-
-	tsk->flags |= PF_DUMPCORE;
-	if (atomic_read(&mm->mm_users) == nr + 1)
-		goto done;
-	/*
-	 * We should find and kill all tasks which use this mm, and we should
-	 * count them correctly into ->nr_threads. We don't take tasklist
-	 * lock, but this is safe wrt:
-	 *
-	 * fork:
-	 *	None of sub-threads can fork after zap_process(leader). All
-	 *	processes which were created before this point should be
-	 *	visible to zap_threads() because copy_process() adds the new
-	 *	process to the tail of init_task.tasks list, and lock/unlock
-	 *	of ->siglock provides a memory barrier.
-	 *
-	 * do_exit:
-	 *	The caller holds mm->mmap_lock. This means that the task which
-	 *	uses this mm can't pass coredump_task_exit(), so it can't exit
-	 *	or clear its ->mm.
-	 *
-	 * de_thread:
-	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
-	 *	we must see either old or new leader, this does not matter.
-	 *	However, it can change p->sighand, so lock_task_sighand(p)
-	 *	must be used. Since p->mm != NULL and we hold ->mmap_lock
-	 *	it can't fail.
-	 *
-	 *	Note also that "g" can be the old leader with ->mm == NULL
-	 *	and already unhashed and thus removed from ->thread_group.
-	 *	This is OK, __unhash_process()->list_del_rcu() does not
-	 *	clear the ->next pointer, we will find the new leader via
-	 *	next_thread().
-	 */
-	rcu_read_lock();
-	for_each_process(g) {
-		if (g == tsk->group_leader)
-			continue;
-		if (g->flags & PF_KTHREAD)
-			continue;
-
-		for_each_thread(g, p) {
-			if (unlikely(!p->mm))
-				continue;
-			if (unlikely(p->mm == mm)) {
-				lock_task_sighand(p, &flags);
-				nr += zap_process(p, exit_code,
-							SIGNAL_GROUP_EXIT);
-				unlock_task_sighand(p, &flags);
-			}
-			break;
-		}
-	}
-	rcu_read_unlock();
-done:
-	atomic_set(&core_state->nr_threads, nr);
 	return nr;
 }
 
 static int coredump_wait(int exit_code, struct core_state *core_state)
 {
 	struct task_struct *tsk = current;
-	struct mm_struct *mm = tsk->mm;
 	int core_waiters = -EBUSY;
 
 	init_completion(&core_state->startup);
 	core_state->dumper.task = tsk;
 	core_state->dumper.next = NULL;
 
-	if (mmap_write_lock_killable(mm))
-		return -EINTR;
-
-	if (!mm->core_state)
-		core_waiters = zap_threads(tsk, mm, core_state, exit_code);
-	mmap_write_unlock(mm);
-
+	core_waiters = zap_threads(tsk, core_state, exit_code);
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
@@ -483,7 +418,7 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 	return core_waiters;
 }
 
-static void coredump_finish(struct mm_struct *mm, bool core_dumped)
+static void coredump_finish(bool core_dumped)
 {
 	struct core_thread *curr, *next;
 	struct task_struct *task;
@@ -493,9 +428,10 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
 		current->signal->group_exit_code |= 0x80;
 	current->signal->group_exit_task = NULL;
 	current->signal->flags = SIGNAL_GROUP_EXIT;
+	next = current->signal->core_state->dumper.next;
+	current->signal->core_state = NULL;
 	spin_unlock_irq(&current->sighand->siglock);
 
-	next = mm->core_state->dumper.next;
 	while ((curr = next) != NULL) {
 		next = curr->next;
 		task = curr->task;
@@ -507,8 +443,6 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
 		curr->task = NULL;
 		wake_up_process(task);
 	}
-
-	mm->core_state = NULL;
 }
 
 static bool dump_interrupted(void)
@@ -839,7 +773,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 fail_unlock:
 	kfree(argv);
 	kfree(cn.corename);
-	coredump_finish(mm, core_dumped);
+	coredump_finish(core_dumped);
 	revert_creds(old_cred);
 fail_creds:
 	put_cred(cred);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49be8c8ef555..520c51be1e57 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -408,9 +408,9 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
 		   cpumask_pr_args(&task->cpus_mask));
 }
 
-static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
+static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
 {
-	seq_put_decimal_ull(m, "CoreDumping:\t", !!mm->core_state);
+	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
 	seq_putc(m, '\n');
 }
 
@@ -436,7 +436,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 
 	if (mm) {
 		task_mem(m, mm);
-		task_core_dumping(m, mm);
+		task_core_dumping(m, task);
 		task_thp_status(m, mm);
 		mmput(mm);
 	}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f8ee09c711f..1039f6ae922c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -387,17 +387,6 @@ struct vm_area_struct {
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
 
-struct core_thread {
-	struct task_struct *task;
-	struct core_thread *next;
-};
-
-struct core_state {
-	atomic_t nr_threads;
-	struct core_thread dumper;
-	struct completion startup;
-};
-
 struct kioctx_table;
 struct mm_struct {
 	struct {
@@ -518,8 +507,6 @@ struct mm_struct {
 
 		unsigned long flags; /* Must use atomic bitops to access */
 
-		struct core_state *core_state; /* coredumping support */
-
 #ifdef CONFIG_AIO
 		spinlock_t			ioctx_lock;
 		struct kioctx_table __rcu	*ioctx_table;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e5f4ce622ee6..a8fe2a593a3a 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -72,6 +72,17 @@ struct multiprocess_signals {
 	struct hlist_node node;
 };
 
+struct core_thread {
+	struct task_struct *task;
+	struct core_thread *next;
+};
+
+struct core_state {
+	atomic_t nr_threads;
+	struct core_thread dumper;
+	struct completion startup;
+};
+
 /*
  * NOTE! "signal_struct" does not have its own
  * locking, because a shared signal_struct always
@@ -110,6 +121,8 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	struct core_state *core_state; /* coredumping support */
+
 	/*
 	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
 	 * manager, to re-parent orphan (double-forking) child processes
diff --git a/kernel/exit.c b/kernel/exit.c
index 774e6b5061b8..2b355e926c13 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -342,23 +342,18 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 static void coredump_task_exit(struct task_struct *tsk)
 {
 	struct core_state *core_state;
-	struct mm_struct *mm;
-
-	mm = tsk->mm;
-	if (!mm)
-		return;
 
 	/*
 	 * Serialize with any possible pending coredump.
-	 * We must hold mmap_lock around checking core_state
+	 * We must hold siglock around checking core_state
 	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
 	 * will increment ->nr_threads for each thread in the
 	 * group without PF_POSTCOREDUMP set.
 	 */
-	mmap_read_lock(mm);
+	spin_lock_irq(&tsk->sighand->siglock);
 	tsk->flags |= PF_POSTCOREDUMP;
-	core_state = mm->core_state;
-	mmap_read_unlock(mm);
+	core_state = tsk->signal->core_state;
+	spin_unlock_irq(&tsk->sighand->siglock);
 	if (core_state) {
 		struct core_thread self;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9bd9f2da9e41..c8adb76982f7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1044,7 +1044,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
-	mm->core_state = NULL;
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] per signal_struct coredumps
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
                   ` (5 preceding siblings ...)
  2021-09-24  0:12 ` [PATCH 6/6] coredump: Limit coredumps to a single thread group Eric W. Biederman
@ 2021-09-24  5:59 ` Kees Cook
  2021-09-24 14:00   ` Eric W. Biederman
  2021-09-24 15:22 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
  7 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2021-09-24  5:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:08:09PM -0500, Eric W. Biederman wrote:
> Current coredumps are mixed up with the exit code, the signal handling
> code and with the ptrace code in was they are much more complicated than
> necessary and difficult to follow.
> 
> This series of changes starts with ptrace_stop and cleans it up,
> making it easier to follow what is happening in ptrace_stop.
> Then cleans up the exec interactions with coredumps.
> Then cleans up the coredump interactions with exit.
> Then the coredump interactions with the signal handling code is clean
> up.
> 
> The first and last changes are bug fixes for minor bugs.

I haven't had a chance to carefully look through this yet, but I like
the sound of it. :)

Do we have any behavioral tests around this? The ptrace tests in seccomp
don't explicitly exercise the exit handling. Are there regression tests
for "rr"? They're usually the first to notice subtle changes in ptrace.

What I couldn't tell from my quick skim: does this further change the
behavior around force_sig_seccomp()? Specifically the "am I single
threaded?" check:

        case SECCOMP_RET_KILL_THREAD:
        case SECCOMP_RET_KILL_PROCESS:
        default:
                seccomp_log(this_syscall, SIGSYS, action, true);
                /* Dump core only if this is the last remaining thread. */
                if (action != SECCOMP_RET_KILL_THREAD ||
                    (atomic_read(&current->signal->live) == 1)) {
                        /* Show the original registers in the dump. */
                        syscall_rollback(current, current_pt_regs());
                        /* Trigger a coredump with SIGSYS */
                        force_sig_seccomp(this_syscall, data, true);
                } else {
                        do_exit(SIGSYS);
                }
                return -1; /* skip the syscall go directly to signal handling */

I *think* the answer is "no", in the sense that coredump_wait() is still
calling zap_threads() which calls zap_process(). Which now seem like
should have opposite names. :) And therefore inducing a coredump will
still take out all threads. (i.e. after your series, no changes need to
be made to seccomp for it.)

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] per signal_struct coredumps
  2021-09-24  5:59 ` [PATCH 0/6] per signal_struct coredumps Kees Cook
@ 2021-09-24 14:00   ` Eric W. Biederman
  0 siblings, 0 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24 14:00 UTC (permalink / raw)
  To: Kees Cook; +Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Thu, Sep 23, 2021 at 07:08:09PM -0500, Eric W. Biederman wrote:
>> Current coredumps are mixed up with the exit code, the signal handling
>> code and with the ptrace code in was they are much more complicated than
>> necessary and difficult to follow.
>> 
>> This series of changes starts with ptrace_stop and cleans it up,
>> making it easier to follow what is happening in ptrace_stop.
>> Then cleans up the exec interactions with coredumps.
>> Then cleans up the coredump interactions with exit.
>> Then the coredump interactions with the signal handling code is clean
>> up.
>> 
>> The first and last changes are bug fixes for minor bugs.
>
> I haven't had a chance to carefully look through this yet, but I like
> the sound of it. :)

Please do most of the changes are quite small and straight forward.
Which is why this is several patches to make catching bugs and bisecting
easier.

> Do we have any behavioral tests around this? The ptrace tests in seccomp
> don't explicitly exercise the exit handling. Are there regression tests
> for "rr"? They're usually the first to notice subtle changes in
> ptrace.

There are no tests that I am aware of.  I am hoping to get this into
linux-next so more people will test just because.

> What I couldn't tell from my quick skim: does this further change the
> behavior around force_sig_seccomp()? Specifically the "am I single
> threaded?" check:

There are two changes I can think of with this patchset.
- Tasks killed in the coredump signal path are those that share
  signal_struct, and the coredump will only include those tasks.
- No tasks will reach PTRACE_EVENT_EXIT during a coredump.
  Which actually makes PTRACE_EVENT_EXIT and coredumps more reliable.
  As there is no concern about waiting for each other.


>         case SECCOMP_RET_KILL_THREAD:
>         case SECCOMP_RET_KILL_PROCESS:
>         default:
>                 seccomp_log(this_syscall, SIGSYS, action, true);
>                 /* Dump core only if this is the last remaining thread. */
>                 if (action != SECCOMP_RET_KILL_THREAD ||
>                     (atomic_read(&current->signal->live) == 1)) {
>                         /* Show the original registers in the dump. */
>                         syscall_rollback(current, current_pt_regs());
>                         /* Trigger a coredump with SIGSYS */
>                         force_sig_seccomp(this_syscall, data, true);
>                 } else {
>                         do_exit(SIGSYS);
>                 }
>                 return -1; /* skip the syscall go directly to signal handling */
>
> I *think* the answer is "no", in the sense that coredump_wait() is still
> calling zap_threads() which calls zap_process(). Which now seem like
> should have opposite names. :) And therefore inducing a coredump will
> still take out all threads. (i.e. after your series, no changes need to
> be made to seccomp for it.)

Correct.  Seccomp can stay the same.

What changes in practice is that now SECCOMP_RET_KILL_PROCESS only kills
the process and not other processes that share the mm.


I can imagine a future where the seccomp logic to ask if this is the
final thread will move into signal delivery.  That takes some more
exit cleanups.

In fact once we are convinced the final change in the series is correct.
There are a lot of simplifications to the code that are possible.

One I am hoping for is to move the killing up into complete_signal and
then modifying force_sig_info_to_task to need to stomp sa_handler and
SIGNAL_UNKILLABLE but can instead call the relevant parts of send_signal
by hand, like send_sigqueue does.


Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop
  2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
                   ` (6 preceding siblings ...)
  2021-09-24  5:59 ` [PATCH 0/6] per signal_struct coredumps Kees Cook
@ 2021-09-24 15:22 ` Eric W. Biederman
  7 siblings, 0 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24 15:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Oleg Nesterov, Al Viro, linux-api, Kees Cook,
	David Miller, sparclinux, linux-ia64


Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
exit_code and a siginfo structure.  Neither argument is used by any of
the implementations so just remove the unneeded arguments.

The two arechitectures that implement arch_ptrace_stop are ia64 and
sparc.  Both architectures flush their register stacks before a
ptrace_stack so that all of the register information can be accessed
by debuggers.

As the question of if a register stack needs to be flushed is
independent of why ptrace is stopping not needing arguments make sense.

Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Resent because my little one distracted me as I was sending this last
night and I did not manage to copy the sparc and ia64 folks.

 arch/ia64/include/asm/ptrace.h  |  4 ++--
 arch/sparc/include/asm/ptrace.h |  8 ++++----
 include/linux/ptrace.h          | 22 +++++++++-------------
 kernel/signal.c                 |  4 ++--
 4 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/include/asm/ptrace.h b/arch/ia64/include/asm/ptrace.h
index 08179135905c..f15504f75f10 100644
--- a/arch/ia64/include/asm/ptrace.h
+++ b/arch/ia64/include/asm/ptrace.h
@@ -129,9 +129,9 @@ static inline long regs_return_value(struct pt_regs *regs)
   extern void ia64_decrement_ip (struct pt_regs *pt);
 
   extern void ia64_ptrace_stop(void);
-  #define arch_ptrace_stop(code, info) \
+  #define arch_ptrace_stop() \
 	ia64_ptrace_stop()
-  #define arch_ptrace_stop_needed(code, info) \
+  #define arch_ptrace_stop_needed() \
 	(!test_thread_flag(TIF_RESTORE_RSE))
 
   extern void ptrace_attach_sync_user_rbs (struct task_struct *);
diff --git a/arch/sparc/include/asm/ptrace.h b/arch/sparc/include/asm/ptrace.h
index 71dd82b43cc5..d1419e669027 100644
--- a/arch/sparc/include/asm/ptrace.h
+++ b/arch/sparc/include/asm/ptrace.h
@@ -26,12 +26,12 @@ static inline bool pt_regs_clear_syscall(struct pt_regs *regs)
 	return (regs->tstate &= ~TSTATE_SYSCALL);
 }
 
-#define arch_ptrace_stop_needed(exit_code, info) \
+#define arch_ptrace_stop_needed() \
 ({	flush_user_windows(); \
 	get_thread_wsaved() != 0; \
 })
 
-#define arch_ptrace_stop(exit_code, info) \
+#define arch_ptrace_stop() \
 	synchronize_user_stack()
 
 #define current_pt_regs() \
@@ -129,12 +129,12 @@ static inline bool pt_regs_clear_syscall(struct pt_regs *regs)
 	return (regs->psr &= ~PSR_SYSCALL);
 }
 
-#define arch_ptrace_stop_needed(exit_code, info) \
+#define arch_ptrace_stop_needed() \
 ({	flush_user_windows(); \
 	current_thread_info()->w_saved != 0;	\
 })
 
-#define arch_ptrace_stop(exit_code, info) \
+#define arch_ptrace_stop() \
 	synchronize_user_stack()
 
 #define current_pt_regs() \
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index b5ebf6c01292..8aee2945ff08 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -362,29 +362,25 @@ static inline void user_single_step_report(struct pt_regs *regs)
 #ifndef arch_ptrace_stop_needed
 /**
  * arch_ptrace_stop_needed - Decide whether arch_ptrace_stop() should be called
- * @code:	current->exit_code value ptrace will stop with
- * @info:	siginfo_t pointer (or %NULL) for signal ptrace will stop with
  *
  * This is called with the siglock held, to decide whether or not it's
- * necessary to release the siglock and call arch_ptrace_stop() with the
- * same @code and @info arguments.  It can be defined to a constant if
- * arch_ptrace_stop() is never required, or always is.  On machines where
- * this makes sense, it should be defined to a quick test to optimize out
- * calling arch_ptrace_stop() when it would be superfluous.  For example,
- * if the thread has not been back to user mode since the last stop, the
- * thread state might indicate that nothing needs to be done.
+ * necessary to release the siglock and call arch_ptrace_stop().  It can be
+ * defined to a constant if arch_ptrace_stop() is never required, or always
+ * is.  On machines where this makes sense, it should be defined to a quick
+ * test to optimize out calling arch_ptrace_stop() when it would be
+ * superfluous.  For example, if the thread has not been back to user mode
+ * since the last stop, the thread state might indicate that nothing needs
+ * to be done.
  *
  * This is guaranteed to be invoked once before a task stops for ptrace and
  * may include arch-specific operations necessary prior to a ptrace stop.
  */
-#define arch_ptrace_stop_needed(code, info)	(0)
+#define arch_ptrace_stop_needed()	(0)
 #endif
 
 #ifndef arch_ptrace_stop
 /**
  * arch_ptrace_stop - Do machine-specific work before stopping for ptrace
- * @code:	current->exit_code value ptrace will stop with
- * @info:	siginfo_t pointer (or %NULL) for signal ptrace will stop with
  *
  * This is called with no locks held when arch_ptrace_stop_needed() has
  * just returned nonzero.  It is allowed to block, e.g. for user memory
@@ -394,7 +390,7 @@ static inline void user_single_step_report(struct pt_regs *regs)
  * we only do it when the arch requires it for this particular stop, as
  * indicated by arch_ptrace_stop_needed().
  */
-#define arch_ptrace_stop(code, info)		do { } while (0)
+#define arch_ptrace_stop()		do { } while (0)
 #endif
 
 #ifndef current_pt_regs
diff --git a/kernel/signal.c b/kernel/signal.c
index 9f2dc9cf3208..c9759ff2cb43 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2200,7 +2200,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 {
 	bool gstop_done = false;
 
-	if (arch_ptrace_stop_needed(exit_code, info)) {
+	if (arch_ptrace_stop_needed()) {
 		/*
 		 * The arch code has something special to do before a
 		 * ptrace stop.  This is allowed to block, e.g. for faults
@@ -2210,7 +2210,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 		 * any signal bookkeeping like checking group_stop_count.
 		 */
 		spin_unlock_irq(&current->sighand->siglock);
-		arch_ptrace_stop(exit_code, info);
+		arch_ptrace_stop();
 		spin_lock_irq(&current->sighand->siglock);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop
  2021-09-24  0:09 ` [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop Eric W. Biederman
@ 2021-09-24 15:22   ` Kees Cook
  2021-09-24 15:48     ` Eric W. Biederman
  0 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2021-09-24 15:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:09:34PM -0500, Eric W. Biederman wrote:
> 
> The existence of sigkill_pending is a little silly as it is
> functionally a duplicate of fatal_signal_pending that is used in
> exactly one place.

sigkill_pending() checks for &tsk->signal->shared_pending.signal but
fatal_signal_pending() doesn't.

> Checking for pending fatal signals and returning early in ptrace_stop
> is actively harmful.  It casues the ptrace_stop called by
> ptrace_signal to return early before setting current->exit_code.
> Later when ptrace_signal reads the signal number from
> current->exit_code is undefined, making it unpredictable what will
> happen.
> 
> Instead rely on the fact that schedule will not sleep if there is a
> pending signal that can awaken a task.

This reasoning sound fine, but I can't see where it's happening.
It looks like recalc_sigpending() is supposed to happen at the start
of scheduling? I see it at the end of ptrace_stop(), though, so it looks
like it's reasonable to skip checking shared_pending.

(Does the scheduler deal with shared_pending directly?)

> Removing the explict sigkill_pending test fixes fixes ptrace_signal
> when ptrace_stop does not stop because current->exit_code is always
> set to to signr.
> 
> Cc: stable@vger.kernel.org
> Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
> Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/signal.c | 18 ++++--------------
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 952741f6d0f9..9f2dc9cf3208 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2182,15 +2182,6 @@ static inline bool may_ptrace_stop(void)
>  	return true;
>  }
>  
> -/*
> - * Return non-zero if there is a SIGKILL that should be waking us up.
> - * Called with the siglock held.
> - */
> -static bool sigkill_pending(struct task_struct *tsk)
> -{
> -	return sigismember(&tsk->pending.signal, SIGKILL) ||
> -	       sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
> -}
>  
>  /*
>   * This must be called with current->sighand->siglock held.
> @@ -2217,17 +2208,16 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
>  		 * calling arch_ptrace_stop, so we must release it now.
>  		 * To preserve proper semantics, we must do this before
>  		 * any signal bookkeeping like checking group_stop_count.
> -		 * Meanwhile, a SIGKILL could come in before we retake the
> -		 * siglock.  That must prevent us from sleeping in TASK_TRACED.
> -		 * So after regaining the lock, we must check for SIGKILL.

Where is the sleep this comment is talking about?

i.e. will recalc_sigpending() have been called before the above sleep
would happen? I assume it's after ptrace_stop() returns... But I want to
make sure the sleep isn't in ptrace_stop() itself somewhere I can't see.
I *do* see freezable_schedule() called, and that dumps us into
__schedule(), and I don't see a recalc before it checks
signal_pending_state().

Does a recalc need to happen in plce of the old sigkill_pending()
call?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop
  2021-09-24  0:10 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
@ 2021-09-24 15:26   ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2021-09-24 15:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:10:03PM -0500, Eric W. Biederman wrote:
> 
> Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
> exit_code and a siginfo structure.  Neither argument is used by any of
> the implementations so just remove the unneeded arguments.
> 
> The two arechitectures that implement arch_ptrace_stop are ia64 and
> sparc.  Both architectures flush their register stacks before a
> ptrace_stack so that all of the register information can be accessed
> by debuggers.
> 
> As the question of if a register stack needs to be flushed is
> independent of why ptrace is stopping not needing arguments make sense.
> 
> Cc: David Miller <davem@davemloft.net>
> Cc: sparclinux@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Yeah, this is a no-op change. No one is using the arguments, as you say.

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state
  2021-09-24  0:10 ` [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state Eric W. Biederman
@ 2021-09-24 15:38   ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2021-09-24 15:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:10:43PM -0500, Eric W. Biederman wrote:
> 
> Prevent exec continuing when a fatal signal is pending by replacing
> mmap_read_lock with mmap_read_lock_killable.  This is always the right
> thing to do as userspace will never observe an exec complete when
> there is a fatal signal pending.
> 
> With that change it becomes unnecessary to explicitly test for a core
> dump in progress.  In coredump_wait zap_threads arranges under
> mmap_write_lock for all tasks that use a mm to also have SIGKILL
> pending, which means mmap_read_lock_killable will always return -EINTR
> when old_mm->core_state is present.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..b6079f1a098e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -987,16 +987,14 @@ static int exec_mmap(struct mm_struct *mm)
>  
>  	if (old_mm) {
>  		/*
> -		 * Make sure that if there is a core dump in progress
> -		 * for the old mm, we get out and die instead of going
> -		 * through with the exec.  We must hold mmap_lock around
> -		 * checking core_state and changing tsk->mm.
> +		 * If there is a pending fatal signal perhaps a signal
> +		 * whose default action is to create a coredump get
> +		 * out and die instead of going through with the exec.
>  		 */
> -		mmap_read_lock(old_mm);
> -		if (unlikely(old_mm->core_state)) {
> -			mmap_read_unlock(old_mm);
> +		ret = mmap_read_lock_killable(old_mm);

This appears to ultimately call into rwsem_down_read_slowpath(), which
checks signal_pending_state() (and returns the EINTR from before),
so this looks right to me.

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> +		if (ret) {
>  			up_write(&tsk->signal->exec_update_lock);
> -			return -EINTR;
> +			return ret;
>  		}
>  	}
>  
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop
  2021-09-24 15:22   ` Kees Cook
@ 2021-09-24 15:48     ` Eric W. Biederman
  2021-09-24 19:06       ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24 15:48 UTC (permalink / raw)
  To: Kees Cook; +Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Thu, Sep 23, 2021 at 07:09:34PM -0500, Eric W. Biederman wrote:
>> 
>> The existence of sigkill_pending is a little silly as it is
>> functionally a duplicate of fatal_signal_pending that is used in
>> exactly one place.
>
> sigkill_pending() checks for &tsk->signal->shared_pending.signal but
> fatal_signal_pending() doesn't.

The extra test is unnecessary as all SIGKILL's visit complete_signal
immediately run the loop:

			/*
			 * Start a group exit and wake everybody up.
			 * This way we don't have other threads
			 * running and doing things after a slower
			 * thread has the fatal signal pending.
			 */
			signal->flags = SIGNAL_GROUP_EXIT;
			signal->group_exit_code = sig;
			signal->group_stop_count = 0;
			t = p;
			do {
				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
				sigaddset(&t->pending.signal, SIGKILL);
				signal_wake_up(t, 1);
			} while_each_thread(p, t);
			return;

Which sets SIGKILL in the task specific queue.  Which means only the
non-shared queue needs to be tested.  Further fatal_signal_pending would
be buggy if this was not the case.

>> Checking for pending fatal signals and returning early in ptrace_stop
>> is actively harmful.  It casues the ptrace_stop called by
>> ptrace_signal to return early before setting current->exit_code.
>> Later when ptrace_signal reads the signal number from
>> current->exit_code is undefined, making it unpredictable what will
>> happen.
>> 
>> Instead rely on the fact that schedule will not sleep if there is a
>> pending signal that can awaken a task.
>
> This reasoning sound fine, but I can't see where it's happening.
> It looks like recalc_sigpending() is supposed to happen at the start
> of scheduling? I see it at the end of ptrace_stop(), though, so it looks
> like it's reasonable to skip checking shared_pending.
>
> (Does the scheduler deal with shared_pending directly?)

In the call of signal_pending_state from kernel/core/.c:__schedule().

ptrace_stop would actually be badly broken today if that was not the
case as several places enter into ptrace_event without testing signals
first.

>> Removing the explict sigkill_pending test fixes fixes ptrace_signal
>> when ptrace_stop does not stop because current->exit_code is always
>> set to to signr.
>> 
>> Cc: stable@vger.kernel.org
>> Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
>> Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/signal.c | 18 ++++--------------
>>  1 file changed, 4 insertions(+), 14 deletions(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 952741f6d0f9..9f2dc9cf3208 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2182,15 +2182,6 @@ static inline bool may_ptrace_stop(void)
>>  	return true;
>>  }
>>  
>> -/*
>> - * Return non-zero if there is a SIGKILL that should be waking us up.
>> - * Called with the siglock held.
>> - */
>> -static bool sigkill_pending(struct task_struct *tsk)
>> -{
>> -	return sigismember(&tsk->pending.signal, SIGKILL) ||
>> -	       sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
>> -}
>>  
>>  /*
>>   * This must be called with current->sighand->siglock held.
>> @@ -2217,17 +2208,16 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
>>  		 * calling arch_ptrace_stop, so we must release it now.
>>  		 * To preserve proper semantics, we must do this before
>>  		 * any signal bookkeeping like checking group_stop_count.
>> -		 * Meanwhile, a SIGKILL could come in before we retake the
>> -		 * siglock.  That must prevent us from sleeping in TASK_TRACED.
>> -		 * So after regaining the lock, we must check for SIGKILL.
>
> Where is the sleep this comment is talking about?
>
> i.e. will recalc_sigpending() have been called before the above sleep
> would happen? I assume it's after ptrace_stop() returns... But I want to
> make sure the sleep isn't in ptrace_stop() itself somewhere I can't see.
> I *do* see freezable_schedule() called, and that dumps us into
> __schedule(), and I don't see a recalc before it checks
> signal_pending_state().
>
> Does a recalc need to happen in plce of the old sigkill_pending()
> call?

You read that correctly freezable_schedule is where ptrace_stop sleeps.

The call chain you are looking for looks something like:
send_signal
  complete_signal
     signal_wake_up
       signal_wake_up_state
         set_tsk_thread_flag(t, TIF_SIGPENDING)

That is to say complete_signal sets TIF_SIGPENDING and
the per task siqueue SIGKILL entry.

Calling recalc_sigpending is only needed when a signal is removed from
the queues, not when a signal is added.

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm
  2021-09-24  0:11 ` [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm Eric W. Biederman
@ 2021-09-24 18:28   ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2021-09-24 18:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:11:08PM -0500, Eric W. Biederman wrote:
> 
> Separate the coredump logic from the ordinary exit_mm logic
> by moving the coredump logic out of exit_mm into it's own
> function coredump_exit_mm.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Yup -- mechanical refactoring looks good.

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] coredump:  Don't perform any cleanups before dumping core
  2021-09-24  0:11 ` [PATCH 5/6] coredump: Don't perform any cleanups before dumping core Eric W. Biederman
@ 2021-09-24 18:51   ` Kees Cook
  2021-09-24 21:28     ` Eric W. Biederman
  0 siblings, 1 reply; 23+ messages in thread
From: Kees Cook @ 2021-09-24 18:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:11:39PM -0500, Eric W. Biederman wrote:
> 
> Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
> before PTRACE_EVENT_EXIT, and before any cleanup work for a task
> happens.  This ensures that an accurate copy of the process can be
> captured in the coredump as no cleanup for the process happens before
> the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
> will not be visited by any thread until the coredump is complete.
> 
> Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
> coredump_task_exit can be recognized and ignored in zap_process.
> 
> Now that all of the coredumping happens before exit_mm remove code to
> test for a coredump in progress from mm_release.
> 
> Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
> The other tests in may_ptrace_stop all concern avoiding stopping
> during a coredump.  These tests are no longer necessary as it is now
> guaranteed that fatal_signal_pending will be set if the code enters
> ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
> not to stop if fatal_signal_pending returns true.
> 
> Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
> ptrace_stop without fatal_signal_pending being true, as signals are
> dequeued in get_signal before calling do_exit.  This is no longer
> an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
> until after the coredump completes.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/coredump.c         |  8 ++++----
>  include/linux/sched.h |  1 +
>  kernel/exit.c         | 19 ++++++++++++-------
>  kernel/fork.c         |  3 +--
>  kernel/signal.c       | 27 +--------------------------
>  mm/oom_kill.c         |  2 +-
>  6 files changed, 20 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 5e0e08a7fb9b..d576287fb88b 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -359,7 +359,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
>  
>  	for_each_thread(start, t) {
>  		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> -		if (t != current && t->mm) {
> +		if (t != current && !(t->flags & PF_POSTCOREDUMP)) {

PF_POSTCOREDUMP is "my exit path is done with anything associated with
coredumping, including not having dumped core", yes? i.e. it's a task
lifetime mark, not a "did this actually dump core" mark?

>  			sigaddset(&t->pending.signal, SIGKILL);
>  			signal_wake_up(t, 1);
>  			nr++;
> @@ -404,8 +404,8 @@ static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
>  	 *
>  	 * do_exit:
>  	 *	The caller holds mm->mmap_lock. This means that the task which
> -	 *	uses this mm can't pass coredump_exit_mm(), so it can't exit or
> -	 *	clear its ->mm.
> +	 *	uses this mm can't pass coredump_task_exit(), so it can't exit
> +	 *	or clear its ->mm.
>  	 *
>  	 * de_thread:
>  	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
> @@ -500,7 +500,7 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
>  		next = curr->next;
>  		task = curr->task;
>  		/*
> -		 * see coredump_exit_mm(), curr->task must not see
> +		 * see coredump_task_exit(), curr->task must not see
>  		 * ->task == NULL before we read ->next.
>  		 */
>  		smp_mb();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..f3741f23935e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1664,6 +1664,7 @@ extern struct pid *cad_pid;
>  #define PF_VCPU			0x00000001	/* I'm a virtual CPU */
>  #define PF_IDLE			0x00000002	/* I am an IDLE thread */
>  #define PF_EXITING		0x00000004	/* Getting shut down */
> +#define PF_POSTCOREDUMP		0x00000008	/* Coredumps should ignore this task */

This might need some bikeshedding. It happens that the coredump code
(zap_process(), actually) will ignore it, but I think what's being
described is "this process has reached an point-of-no-return on the exit
path, where coredumping is either done or never happened".

>  #define PF_IO_WORKER		0x00000010	/* Task is an IO worker */
>  #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
>  #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index cb1619d8fd64..774e6b5061b8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -339,23 +339,29 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
>  	}
>  }
>  
> -static void coredump_exit_mm(struct mm_struct *mm)
> +static void coredump_task_exit(struct task_struct *tsk)
>  {
>  	struct core_state *core_state;
> +	struct mm_struct *mm;
> +
> +	mm = tsk->mm;
> +	if (!mm)
> +		return;
>  
>  	/*
>  	 * Serialize with any possible pending coredump.
>  	 * We must hold mmap_lock around checking core_state
> -	 * and clearing tsk->mm.  The core-inducing thread
> +	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
>  	 * will increment ->nr_threads for each thread in the
> -	 * group with ->mm != NULL.
> +	 * group without PF_POSTCOREDUMP set.
>  	 */
> +	mmap_read_lock(mm);
> +	tsk->flags |= PF_POSTCOREDUMP;

What are the races possible here?

- 2 threads exiting, but neither have core_state, so no change, looks ok
- 1 thread exiting, 1 thread has dumped. core_state exists, in which
  case this starts a loop to wait for wakeup?
	if dumping thread enters zap_process, gets to sigaddset/signal_wake_up
	then the exiting thread sets PF_POSTCOREDUMP and goes to sleep,
	will it wait forever? (I can't tell if sighand locking prevents
	this ordering problem...)
- 2 threads dumping -- similar race to above? I suspect I'm missing
  something, as I'd expect this case to already be handled.

-Kees

>  	core_state = mm->core_state;
> +	mmap_read_unlock(mm);
>  	if (core_state) {
>  		struct core_thread self;
>  
> -		mmap_read_unlock(mm);
> -
>  		self.task = current;
>  		if (self.task->flags & PF_SIGNALED)
>  			self.next = xchg(&core_state->dumper.next, &self);
> @@ -375,7 +381,6 @@ static void coredump_exit_mm(struct mm_struct *mm)
>  			freezable_schedule();
>  		}
>  		__set_current_state(TASK_RUNNING);
> -		mmap_read_lock(mm);
>  	}
>  }
>  
> @@ -480,7 +485,6 @@ static void exit_mm(void)
>  		return;
>  	sync_mm_rss(mm);
>  	mmap_read_lock(mm);
> -	coredump_exit_mm(mm);
>  	mmgrab(mm);
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
> @@ -768,6 +772,7 @@ void __noreturn do_exit(long code)
>  	profile_task_exit(tsk);
>  	kcov_task_exit(tsk);
>  
> +	coredump_task_exit(tsk);
>  	ptrace_event(PTRACE_EVENT_EXIT, code);
>  
>  	validate_creds_for_do_exit(tsk);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 38681ad44c76..9bd9f2da9e41 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1392,8 +1392,7 @@ static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>  	 * purposes.
>  	 */
>  	if (tsk->clear_child_tid) {
> -		if (!(tsk->signal->flags & SIGNAL_GROUP_COREDUMP) &&
> -		    atomic_read(&mm->mm_users) > 1) {
> +		if (atomic_read(&mm->mm_users) > 1) {
>  			/*
>  			 * We don't check the error code - if userspace has
>  			 * not set up a proper pointer then tough luck.
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c9759ff2cb43..b0db80acc6ef 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2158,31 +2158,6 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> -static inline bool may_ptrace_stop(void)
> -{
> -	if (!likely(current->ptrace))
> -		return false;
> -	/*
> -	 * Are we in the middle of do_coredump?
> -	 * If so and our tracer is also part of the coredump stopping
> -	 * is a deadlock situation, and pointless because our tracer
> -	 * is dead so don't allow us to stop.
> -	 * If SIGKILL was already sent before the caller unlocked
> -	 * ->siglock we must see ->core_state != NULL. Otherwise it
> -	 * is safe to enter schedule().
> -	 *
> -	 * This is almost outdated, a task with the pending SIGKILL can't
> -	 * block in TASK_TRACED. But PTRACE_EVENT_EXIT can be reported
> -	 * after SIGKILL was already dequeued.
> -	 */
> -	if (unlikely(current->mm->core_state) &&
> -	    unlikely(current->mm == current->parent->mm))
> -		return false;
> -
> -	return true;
> -}
> -
> -
>  /*
>   * This must be called with current->sighand->siglock held.
>   *
> @@ -2263,7 +2238,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
>  
>  	spin_unlock_irq(&current->sighand->siglock);
>  	read_lock(&tasklist_lock);
> -	if (may_ptrace_stop()) {
> +	if (likely(current->ptrace)) {
>  		/*
>  		 * Notify parents of the stop.
>  		 *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 295c8bdfd6c8..7877c755ab37 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -788,7 +788,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
>  
>  	/*
>  	 * A coredumping process may sleep for an extended period in
> -	 * coredump_exit_mm(), so the oom killer cannot assume that
> +	 * coredump_task_exit(), so the oom killer cannot assume that
>  	 * the process will promptly exit and release memory.
>  	 */
>  	if (sig->flags & SIGNAL_GROUP_COREDUMP)
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] coredump: Limit coredumps to a single thread group
  2021-09-24  0:12 ` [PATCH 6/6] coredump: Limit coredumps to a single thread group Eric W. Biederman
@ 2021-09-24 18:56   ` Kees Cook
  2021-10-06 17:03     ` Eric W. Biederman
  2021-11-19 16:03   ` Kyle Huey
  1 sibling, 1 reply; 23+ messages in thread
From: Kees Cook @ 2021-09-24 18:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Thu, Sep 23, 2021 at 07:12:33PM -0500, Eric W. Biederman wrote:
> 
> Today when a signal is delivered with a handler of SIG_DFL whose
> default behavior is to generate a core dump not only that process but
> every process that shares the mm is killed.
> 
> In the case of vfork this looks like a real world problem.  Consider
> the following well defined sequence.
> 
> 	if (vfork() == 0) {
> 		execve(...);
> 		_exit(EXIT_FAILURE);
> 	}
> 
> If a signal that generates a core dump is received after vfork but
> before the execve changes the mm the process that called vfork will
> also be killed (as the mm is shared).
> 
> Similarly if the execve fails after the point of no return the kernel
> delivers SIGSEGV which will kill both the exec'ing process and because
> the mm is shared the process that called vfork as well.
> 
> As far as I can tell this behavior is a violation of people's
> reasonable expectations, POSIX, and is unnecessarily fragile when the
> system is low on memory.
> 
> Solve this by making a userspace visible change to only kill a single
> process/thread group.  This is possible because Jann Horn recently
> modified[1] the coredump code so that the mm can safely be modified
> while the coredump is happening.  With LinuxThreads long gone I don't
> expect anyone to have a notice this behavior change in practice.
> 
> To accomplish this move the core_state pointer from mm_struct to
> signal_struct, which allows different thread groups to coredump
> simultatenously.
> 
> In zap_threads remove the work to kill anything except for the current
> thread group.
> 
> [1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
> Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

This looks correct to me, but depends on the 5/6 not introducing any
races. So, to that end:

Reviewed-by: Kees Cook <keescook@chromium.org>

If you have some local tools you've been using for testing this series,
can you toss them into tools/testing/selftests/ptrace/ ? I can help
clean them up if want.

-Kees

> ---
>  fs/binfmt_elf.c              |  4 +-
>  fs/binfmt_elf_fdpic.c        |  2 +-
>  fs/coredump.c                | 84 ++++--------------------------------
>  fs/proc/array.c              |  6 +--
>  include/linux/mm_types.h     | 13 ------
>  include/linux/sched/signal.h | 13 ++++++
>  kernel/exit.c                | 13 ++----
>  kernel/fork.c                |  1 -
>  8 files changed, 32 insertions(+), 104 deletions(-)
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 69d900a8473d..796e5327ee7d 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1834,7 +1834,7 @@ static int fill_note_info(struct elfhdr *elf, int phdrs,
>  	/*
>  	 * Allocate a structure for each thread.
>  	 */
> -	for (ct = &dump_task->mm->core_state->dumper; ct; ct = ct->next) {
> +	for (ct = &dump_task->signal->core_state->dumper; ct; ct = ct->next) {
>  		t = kzalloc(offsetof(struct elf_thread_core_info,
>  				     notes[info->thread_notes]),
>  			    GFP_KERNEL);
> @@ -2024,7 +2024,7 @@ static int fill_note_info(struct elfhdr *elf, int phdrs,
>  	if (!elf_note_info_init(info))
>  		return 0;
>  
> -	for (ct = current->mm->core_state->dumper.next;
> +	for (ct = current->signal->core_state->dumper.next;
>  					ct; ct = ct->next) {
>  		ets = kzalloc(sizeof(*ets), GFP_KERNEL);
>  		if (!ets)
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index 6d8fd6030cbb..c6f588dc4a9d 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -1494,7 +1494,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
>  	if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, &vma_data_size))
>  		goto end_coredump;
>  
> -	for (ct = current->mm->core_state->dumper.next;
> +	for (ct = current->signal->core_state->dumper.next;
>  					ct; ct = ct->next) {
>  		tmp = elf_dump_thread_status(cprm->siginfo->si_signo,
>  					     ct->task, &thread_status_size);
> diff --git a/fs/coredump.c b/fs/coredump.c
> index d576287fb88b..a6b3c196cdef 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -369,99 +369,34 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
>  	return nr;
>  }
>  
> -static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
> +static int zap_threads(struct task_struct *tsk,
>  			struct core_state *core_state, int exit_code)
>  {
> -	struct task_struct *g, *p;
> -	unsigned long flags;
>  	int nr = -EAGAIN;
>  
>  	spin_lock_irq(&tsk->sighand->siglock);
>  	if (!signal_group_exit(tsk->signal)) {
> -		mm->core_state = core_state;
> +		tsk->signal->core_state = core_state;
>  		tsk->signal->group_exit_task = tsk;
>  		nr = zap_process(tsk, exit_code, 0);
>  		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
> +		tsk->flags |= PF_DUMPCORE;
> +		atomic_set(&core_state->nr_threads, nr);
>  	}
>  	spin_unlock_irq(&tsk->sighand->siglock);
> -	if (unlikely(nr < 0))
> -		return nr;
> -
> -	tsk->flags |= PF_DUMPCORE;
> -	if (atomic_read(&mm->mm_users) == nr + 1)
> -		goto done;
> -	/*
> -	 * We should find and kill all tasks which use this mm, and we should
> -	 * count them correctly into ->nr_threads. We don't take tasklist
> -	 * lock, but this is safe wrt:
> -	 *
> -	 * fork:
> -	 *	None of sub-threads can fork after zap_process(leader). All
> -	 *	processes which were created before this point should be
> -	 *	visible to zap_threads() because copy_process() adds the new
> -	 *	process to the tail of init_task.tasks list, and lock/unlock
> -	 *	of ->siglock provides a memory barrier.
> -	 *
> -	 * do_exit:
> -	 *	The caller holds mm->mmap_lock. This means that the task which
> -	 *	uses this mm can't pass coredump_task_exit(), so it can't exit
> -	 *	or clear its ->mm.
> -	 *
> -	 * de_thread:
> -	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
> -	 *	we must see either old or new leader, this does not matter.
> -	 *	However, it can change p->sighand, so lock_task_sighand(p)
> -	 *	must be used. Since p->mm != NULL and we hold ->mmap_lock
> -	 *	it can't fail.
> -	 *
> -	 *	Note also that "g" can be the old leader with ->mm == NULL
> -	 *	and already unhashed and thus removed from ->thread_group.
> -	 *	This is OK, __unhash_process()->list_del_rcu() does not
> -	 *	clear the ->next pointer, we will find the new leader via
> -	 *	next_thread().
> -	 */
> -	rcu_read_lock();
> -	for_each_process(g) {
> -		if (g == tsk->group_leader)
> -			continue;
> -		if (g->flags & PF_KTHREAD)
> -			continue;
> -
> -		for_each_thread(g, p) {
> -			if (unlikely(!p->mm))
> -				continue;
> -			if (unlikely(p->mm == mm)) {
> -				lock_task_sighand(p, &flags);
> -				nr += zap_process(p, exit_code,
> -							SIGNAL_GROUP_EXIT);
> -				unlock_task_sighand(p, &flags);
> -			}
> -			break;
> -		}
> -	}
> -	rcu_read_unlock();
> -done:
> -	atomic_set(&core_state->nr_threads, nr);
>  	return nr;
>  }
>  
>  static int coredump_wait(int exit_code, struct core_state *core_state)
>  {
>  	struct task_struct *tsk = current;
> -	struct mm_struct *mm = tsk->mm;
>  	int core_waiters = -EBUSY;
>  
>  	init_completion(&core_state->startup);
>  	core_state->dumper.task = tsk;
>  	core_state->dumper.next = NULL;
>  
> -	if (mmap_write_lock_killable(mm))
> -		return -EINTR;
> -
> -	if (!mm->core_state)
> -		core_waiters = zap_threads(tsk, mm, core_state, exit_code);
> -	mmap_write_unlock(mm);
> -
> +	core_waiters = zap_threads(tsk, core_state, exit_code);
>  	if (core_waiters > 0) {
>  		struct core_thread *ptr;
>  
> @@ -483,7 +418,7 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
>  	return core_waiters;
>  }
>  
> -static void coredump_finish(struct mm_struct *mm, bool core_dumped)
> +static void coredump_finish(bool core_dumped)
>  {
>  	struct core_thread *curr, *next;
>  	struct task_struct *task;
> @@ -493,9 +428,10 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
>  		current->signal->group_exit_code |= 0x80;
>  	current->signal->group_exit_task = NULL;
>  	current->signal->flags = SIGNAL_GROUP_EXIT;
> +	next = current->signal->core_state->dumper.next;
> +	current->signal->core_state = NULL;
>  	spin_unlock_irq(&current->sighand->siglock);
>  
> -	next = mm->core_state->dumper.next;
>  	while ((curr = next) != NULL) {
>  		next = curr->next;
>  		task = curr->task;
> @@ -507,8 +443,6 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
>  		curr->task = NULL;
>  		wake_up_process(task);
>  	}
> -
> -	mm->core_state = NULL;
>  }
>  
>  static bool dump_interrupted(void)
> @@ -839,7 +773,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  fail_unlock:
>  	kfree(argv);
>  	kfree(cn.corename);
> -	coredump_finish(mm, core_dumped);
> +	coredump_finish(core_dumped);
>  	revert_creds(old_cred);
>  fail_creds:
>  	put_cred(cred);
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 49be8c8ef555..520c51be1e57 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -408,9 +408,9 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
>  		   cpumask_pr_args(&task->cpus_mask));
>  }
>  
> -static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
> +static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
>  {
> -	seq_put_decimal_ull(m, "CoreDumping:\t", !!mm->core_state);
> +	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
>  	seq_putc(m, '\n');
>  }
>  
> @@ -436,7 +436,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>  
>  	if (mm) {
>  		task_mem(m, mm);
> -		task_core_dumping(m, mm);
> +		task_core_dumping(m, task);
>  		task_thp_status(m, mm);
>  		mmput(mm);
>  	}
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7f8ee09c711f..1039f6ae922c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -387,17 +387,6 @@ struct vm_area_struct {
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  } __randomize_layout;
>  
> -struct core_thread {
> -	struct task_struct *task;
> -	struct core_thread *next;
> -};
> -
> -struct core_state {
> -	atomic_t nr_threads;
> -	struct core_thread dumper;
> -	struct completion startup;
> -};
> -
>  struct kioctx_table;
>  struct mm_struct {
>  	struct {
> @@ -518,8 +507,6 @@ struct mm_struct {
>  
>  		unsigned long flags; /* Must use atomic bitops to access */
>  
> -		struct core_state *core_state; /* coredumping support */
> -
>  #ifdef CONFIG_AIO
>  		spinlock_t			ioctx_lock;
>  		struct kioctx_table __rcu	*ioctx_table;
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index e5f4ce622ee6..a8fe2a593a3a 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -72,6 +72,17 @@ struct multiprocess_signals {
>  	struct hlist_node node;
>  };
>  
> +struct core_thread {
> +	struct task_struct *task;
> +	struct core_thread *next;
> +};
> +
> +struct core_state {
> +	atomic_t nr_threads;
> +	struct core_thread dumper;
> +	struct completion startup;
> +};
> +
>  /*
>   * NOTE! "signal_struct" does not have its own
>   * locking, because a shared signal_struct always
> @@ -110,6 +121,8 @@ struct signal_struct {
>  	int			group_stop_count;
>  	unsigned int		flags; /* see SIGNAL_* flags below */
>  
> +	struct core_state *core_state; /* coredumping support */
> +
>  	/*
>  	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
>  	 * manager, to re-parent orphan (double-forking) child processes
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 774e6b5061b8..2b355e926c13 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -342,23 +342,18 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
>  static void coredump_task_exit(struct task_struct *tsk)
>  {
>  	struct core_state *core_state;
> -	struct mm_struct *mm;
> -
> -	mm = tsk->mm;
> -	if (!mm)
> -		return;
>  
>  	/*
>  	 * Serialize with any possible pending coredump.
> -	 * We must hold mmap_lock around checking core_state
> +	 * We must hold siglock around checking core_state
>  	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
>  	 * will increment ->nr_threads for each thread in the
>  	 * group without PF_POSTCOREDUMP set.
>  	 */
> -	mmap_read_lock(mm);
> +	spin_lock_irq(&tsk->sighand->siglock);
>  	tsk->flags |= PF_POSTCOREDUMP;
> -	core_state = mm->core_state;
> -	mmap_read_unlock(mm);
> +	core_state = tsk->signal->core_state;
> +	spin_unlock_irq(&tsk->sighand->siglock);
>  	if (core_state) {
>  		struct core_thread self;
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9bd9f2da9e41..c8adb76982f7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1044,7 +1044,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	seqcount_init(&mm->write_protect_seq);
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
> -	mm->core_state = NULL;
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
>  	mm->locked_vm = 0;
> -- 
> 2.20.1
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop
  2021-09-24 15:48     ` Eric W. Biederman
@ 2021-09-24 19:06       ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2021-09-24 19:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Fri, Sep 24, 2021 at 10:48:18AM -0500, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
> 
> > On Thu, Sep 23, 2021 at 07:09:34PM -0500, Eric W. Biederman wrote:
> >> 
> >> The existence of sigkill_pending is a little silly as it is
> >> functionally a duplicate of fatal_signal_pending that is used in
> >> exactly one place.
> >
> > sigkill_pending() checks for &tsk->signal->shared_pending.signal but
> > fatal_signal_pending() doesn't.
> 
> The extra test is unnecessary as all SIGKILL's visit complete_signal
> immediately run the loop:
> 
> 			/*
> 			 * Start a group exit and wake everybody up.
> 			 * This way we don't have other threads
> 			 * running and doing things after a slower
> 			 * thread has the fatal signal pending.
> 			 */
> 			signal->flags = SIGNAL_GROUP_EXIT;
> 			signal->group_exit_code = sig;
> 			signal->group_stop_count = 0;
> 			t = p;
> 			do {
> 				task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> 				sigaddset(&t->pending.signal, SIGKILL);
> 				signal_wake_up(t, 1);
> 			} while_each_thread(p, t);
> 			return;
> 
> Which sets SIGKILL in the task specific queue.  Which means only the
> non-shared queue needs to be tested.  Further fatal_signal_pending would
> be buggy if this was not the case.

Okay, so SIGKILL is special from the perspective of shared_pending. Why
was it tested for before? Or rather: how could SIGKILL ever have gotten
set in shared_pending?

Oh, I think I see what you mean about complete_signal() now: that's just
looking at sig, and doesn't care where it got written. i.e. SIGKILL gets
immediately written to pending, even if the prior path through
__send_signal() only wrote it to shared_pending.

> 
> >> Checking for pending fatal signals and returning early in ptrace_stop
> >> is actively harmful.  It casues the ptrace_stop called by
> >> ptrace_signal to return early before setting current->exit_code.
> >> Later when ptrace_signal reads the signal number from
> >> current->exit_code is undefined, making it unpredictable what will
> >> happen.
> >> 
> >> Instead rely on the fact that schedule will not sleep if there is a
> >> pending signal that can awaken a task.
> >
> > This reasoning sound fine, but I can't see where it's happening.
> > It looks like recalc_sigpending() is supposed to happen at the start
> > of scheduling? I see it at the end of ptrace_stop(), though, so it looks
> > like it's reasonable to skip checking shared_pending.
> >
> > (Does the scheduler deal with shared_pending directly?)
> 
> In the call of signal_pending_state from kernel/core/.c:__schedule().
> 
> ptrace_stop would actually be badly broken today if that was not the
> case as several places enter into ptrace_event without testing signals
> first.
> 
> >> Removing the explict sigkill_pending test fixes fixes ptrace_signal
> >> when ptrace_stop does not stop because current->exit_code is always
> >> set to to signr.
> >> 
> >> Cc: stable@vger.kernel.org
> >> Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
> >> Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  kernel/signal.c | 18 ++++--------------
> >>  1 file changed, 4 insertions(+), 14 deletions(-)
> >> 
> >> diff --git a/kernel/signal.c b/kernel/signal.c
> >> index 952741f6d0f9..9f2dc9cf3208 100644
> >> --- a/kernel/signal.c
> >> +++ b/kernel/signal.c
> >> @@ -2182,15 +2182,6 @@ static inline bool may_ptrace_stop(void)
> >>  	return true;
> >>  }
> >>  
> >> -/*
> >> - * Return non-zero if there is a SIGKILL that should be waking us up.
> >> - * Called with the siglock held.
> >> - */
> >> -static bool sigkill_pending(struct task_struct *tsk)
> >> -{
> >> -	return sigismember(&tsk->pending.signal, SIGKILL) ||
> >> -	       sigismember(&tsk->signal->shared_pending.signal, SIGKILL);
> >> -}
> >>  
> >>  /*
> >>   * This must be called with current->sighand->siglock held.
> >> @@ -2217,17 +2208,16 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
> >>  		 * calling arch_ptrace_stop, so we must release it now.
> >>  		 * To preserve proper semantics, we must do this before
> >>  		 * any signal bookkeeping like checking group_stop_count.
> >> -		 * Meanwhile, a SIGKILL could come in before we retake the
> >> -		 * siglock.  That must prevent us from sleeping in TASK_TRACED.
> >> -		 * So after regaining the lock, we must check for SIGKILL.
> >
> > Where is the sleep this comment is talking about?
> >
> > i.e. will recalc_sigpending() have been called before the above sleep
> > would happen? I assume it's after ptrace_stop() returns... But I want to
> > make sure the sleep isn't in ptrace_stop() itself somewhere I can't see.
> > I *do* see freezable_schedule() called, and that dumps us into
> > __schedule(), and I don't see a recalc before it checks
> > signal_pending_state().
> >
> > Does a recalc need to happen in plce of the old sigkill_pending()
> > call?
> 
> You read that correctly freezable_schedule is where ptrace_stop sleeps.
> 
> The call chain you are looking for looks something like:
> send_signal
>   complete_signal
>      signal_wake_up
>        signal_wake_up_state
>          set_tsk_thread_flag(t, TIF_SIGPENDING)
> 
> That is to say complete_signal sets TIF_SIGPENDING and
> the per task siqueue SIGKILL entry.
> 
> Calling recalc_sigpending is only needed when a signal is removed from
> the queues, not when a signal is added.

Got it; thanks! Yeah, it was mainly I didn't see where SIGKILL got
handled specially, and now I do. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] coredump:  Don't perform any cleanups before dumping core
  2021-09-24 18:51   ` Kees Cook
@ 2021-09-24 21:28     ` Eric W. Biederman
  2021-09-24 21:41       ` Kees Cook
  0 siblings, 1 reply; 23+ messages in thread
From: Eric W. Biederman @ 2021-09-24 21:28 UTC (permalink / raw)
  To: Kees Cook; +Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

Kees Cook <keescook@chromium.org> writes:

> On Thu, Sep 23, 2021 at 07:11:39PM -0500, Eric W. Biederman wrote:
>> 
>> Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
>> before PTRACE_EVENT_EXIT, and before any cleanup work for a task
>> happens.  This ensures that an accurate copy of the process can be
>> captured in the coredump as no cleanup for the process happens before
>> the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
>> will not be visited by any thread until the coredump is complete.
>> 
>> Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
>> coredump_task_exit can be recognized and ignored in zap_process.
>> 
>> Now that all of the coredumping happens before exit_mm remove code to
>> test for a coredump in progress from mm_release.
>> 
>> Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
>> The other tests in may_ptrace_stop all concern avoiding stopping
>> during a coredump.  These tests are no longer necessary as it is now
>> guaranteed that fatal_signal_pending will be set if the code enters
>> ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
>> not to stop if fatal_signal_pending returns true.
>> 
>> Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
>> ptrace_stop without fatal_signal_pending being true, as signals are
>> dequeued in get_signal before calling do_exit.  This is no longer
>> an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
>> until after the coredump completes.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/coredump.c         |  8 ++++----
>>  include/linux/sched.h |  1 +
>>  kernel/exit.c         | 19 ++++++++++++-------
>>  kernel/fork.c         |  3 +--
>>  kernel/signal.c       | 27 +--------------------------
>>  mm/oom_kill.c         |  2 +-
>>  6 files changed, 20 insertions(+), 40 deletions(-)
>> 
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 5e0e08a7fb9b..d576287fb88b 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -359,7 +359,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
>>  
>>  	for_each_thread(start, t) {
>>  		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>> -		if (t != current && t->mm) {
>> +		if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
>
> PF_POSTCOREDUMP is "my exit path is done with anything associated with
> coredumping, including not having dumped core", yes? i.e. it's a task
> lifetime mark, not a "did this actually dump core" mark?

Correct.  I am open to a better name if you have one.

>>  			sigaddset(&t->pending.signal, SIGKILL);
>>  			signal_wake_up(t, 1);
>>  			nr++;
>> @@ -404,8 +404,8 @@ static int zap_threads(struct task_struct *tsk, struct mm_struct *mm,
>>  	 *
>>  	 * do_exit:
>>  	 *	The caller holds mm->mmap_lock. This means that the task which
>> -	 *	uses this mm can't pass coredump_exit_mm(), so it can't exit or
>> -	 *	clear its ->mm.
>> +	 *	uses this mm can't pass coredump_task_exit(), so it can't exit
>> +	 *	or clear its ->mm.
>>  	 *
>>  	 * de_thread:
>>  	 *	It does list_replace_rcu(&leader->tasks, &current->tasks),
>> @@ -500,7 +500,7 @@ static void coredump_finish(struct mm_struct *mm, bool core_dumped)
>>  		next = curr->next;
>>  		task = curr->task;
>>  		/*
>> -		 * see coredump_exit_mm(), curr->task must not see
>> +		 * see coredump_task_exit(), curr->task must not see
>>  		 * ->task == NULL before we read ->next.
>>  		 */
>>  		smp_mb();
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index e12b524426b0..f3741f23935e 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1664,6 +1664,7 @@ extern struct pid *cad_pid;
>>  #define PF_VCPU			0x00000001	/* I'm a virtual CPU */
>>  #define PF_IDLE			0x00000002	/* I am an IDLE thread */
>>  #define PF_EXITING		0x00000004	/* Getting shut down */
>> +#define PF_POSTCOREDUMP		0x00000008	/* Coredumps should ignore this task */
>
> This might need some bikeshedding. It happens that the coredump code
> (zap_process(), actually) will ignore it, but I think what's being
> described is "this process has reached an point-of-no-return on the exit
> path, where coredumping is either done or never happened".

Yes.

I would be happy for any improvements to the naming or the description.
It all seemed reasonable to me (being the author) but from you reaction
I can see it is confusing, and I can see how it is confusing.


>>  #define PF_IO_WORKER		0x00000010	/* Task is an IO worker */
>>  #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
>>  #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index cb1619d8fd64..774e6b5061b8 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -339,23 +339,29 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
>>  	}
>>  }
>>  
>> -static void coredump_exit_mm(struct mm_struct *mm)
>> +static void coredump_task_exit(struct task_struct *tsk)
>>  {
>>  	struct core_state *core_state;
>> +	struct mm_struct *mm;
>> +
>> +	mm = tsk->mm;
>> +	if (!mm)
>> +		return;
>>  
>>  	/*
>>  	 * Serialize with any possible pending coredump.
>>  	 * We must hold mmap_lock around checking core_state
>> -	 * and clearing tsk->mm.  The core-inducing thread
>> +	 * and setting PF_POSTCOREDUMP.  The core-inducing thread
>>  	 * will increment ->nr_threads for each thread in the
>> -	 * group with ->mm != NULL.
>> +	 * group without PF_POSTCOREDUMP set.
>>  	 */
>> +	mmap_read_lock(mm);
>> +	tsk->flags |= PF_POSTCOREDUMP;
>
> What are the races possible here?
>
> - 2 threads exiting, but neither have core_state, so no change, looks ok
> - 1 thread exiting, 1 thread has dumped. core_state exists, in which
>   case this starts a loop to wait for wakeup?
> 	if dumping thread enters zap_process, gets to sigaddset/signal_wake_up
> 	then the exiting thread sets PF_POSTCOREDUMP and goes to sleep,
> 	will it wait forever? (I can't tell if sighand locking prevents
> 	this ordering problem...)
> - 2 threads dumping -- similar race to above? I suspect I'm missing
>   something, as I'd expect this case to already be handled.


So everything should work from a locking perspective as I am not
changing the locking I am simply moving the call from exit_mm earlier.

To explain how the locking works.

Coredumps are not handled in complete_signal, so when
a thread dequeues the signal the other threads are all running.

If two threads dequeue core dumping signals at the same time both will
contend for mmap_write_lock one will get it and set core_state the
second will return -EBUSY from coredump_wait and return from do_coredump
and proceed to do_exit.

There similar set of races that zap_threads called from coredump_wait
resolves by testing for signal_group_exit inside of sighand lock.

The normal case is one thread runs through do_coredump, coredump_wait,
and zap_threads, counts the threads and waits in coredump_wait for
all of the other threads to stop.

The other threads proceed to do_exit and  coredump_task_exit.
From their the discover that core_state is set.  And holding
the mmap_read_lock is enough to know ensure that either
no coredump is in progress or all of the setup is complete in
coredump_wait.

If core_state is found it is known that there is a waiter waiting in
coredump_wait.  So the tasks add themselves to the list.  Decrement
the count to indicate they don't need to be waited for any longer
and the last one waits up the waiter in coredump_wait.

The threads then spin in TASK_UNINTERRUPTLE until the coredump
completes.

The dumping thread takes the coredump then calls coredump_finish.
In coredump_finish the linked list is torn down, and each element
of the linked list has sets ".task = NULL"  Then the task is woken
up.

The task waiting in TASK_UNINTERRUPTIBLE wakes up. Sees that .task =
NULL and proceeds with the rest of do_exit.


The only change  this patch makes to that entire process is
"task->mm = NULL" is replaced by setting PF_POSTCOREDUMP.

Does that help?

Eric

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] coredump:  Don't perform any cleanups before dumping core
  2021-09-24 21:28     ` Eric W. Biederman
@ 2021-09-24 21:41       ` Kees Cook
  0 siblings, 0 replies; 23+ messages in thread
From: Kees Cook @ 2021-09-24 21:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

On Fri, Sep 24, 2021 at 04:28:58PM -0500, Eric W. Biederman wrote:
> So everything should work from a locking perspective as I am not
> changing the locking I am simply moving the call from exit_mm earlier.
> 
> To explain how the locking works.
> 
> Coredumps are not handled in complete_signal, so when
> a thread dequeues the signal the other threads are all running.
> 
> If two threads dequeue core dumping signals at the same time both will
> contend for mmap_write_lock one will get it and set core_state the
> second will return -EBUSY from coredump_wait and return from do_coredump
> and proceed to do_exit.
> 
> There similar set of races that zap_threads called from coredump_wait
> resolves by testing for signal_group_exit inside of sighand lock.
> 
> The normal case is one thread runs through do_coredump, coredump_wait,
> and zap_threads, counts the threads and waits in coredump_wait for
> all of the other threads to stop.
> 
> The other threads proceed to do_exit and  coredump_task_exit.
> From their the discover that core_state is set.  And holding
> the mmap_read_lock is enough to know ensure that either
> no coredump is in progress or all of the setup is complete in
> coredump_wait.
> 
> If core_state is found it is known that there is a waiter waiting in
> coredump_wait.  So the tasks add themselves to the list.  Decrement
> the count to indicate they don't need to be waited for any longer
> and the last one waits up the waiter in coredump_wait.
> 
> The threads then spin in TASK_UNINTERRUPTLE until the coredump
> completes.
> 
> The dumping thread takes the coredump then calls coredump_finish.
> In coredump_finish the linked list is torn down, and each element
> of the linked list has sets ".task = NULL"  Then the task is woken
> up.
> 
> The task waiting in TASK_UNINTERRUPTIBLE wakes up. Sees that .task =
> NULL and proceeds with the rest of do_exit.
> 
> 
> The only change  this patch makes to that entire process is
> "task->mm = NULL" is replaced by setting PF_POSTCOREDUMP.
> 
> Does that help?

That does! Thanks very much for the run-down on the flow there. I'm
convinced. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] coredump: Limit coredumps to a single thread group
  2021-09-24 18:56   ` Kees Cook
@ 2021-10-06 17:03     ` Eric W. Biederman
  0 siblings, 0 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-10-06 17:03 UTC (permalink / raw)
  To: Kees Cook; +Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api

[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]

Kees Cook <keescook@chromium.org> writes:

> On Thu, Sep 23, 2021 at 07:12:33PM -0500, Eric W. Biederman wrote:
>> 
>> Today when a signal is delivered with a handler of SIG_DFL whose
>> default behavior is to generate a core dump not only that process but
>> every process that shares the mm is killed.
>> 
>> In the case of vfork this looks like a real world problem.  Consider
>> the following well defined sequence.
>> 
>> 	if (vfork() == 0) {
>> 		execve(...);
>> 		_exit(EXIT_FAILURE);
>> 	}
>> 
>> If a signal that generates a core dump is received after vfork but
>> before the execve changes the mm the process that called vfork will
>> also be killed (as the mm is shared).
>> 
>> Similarly if the execve fails after the point of no return the kernel
>> delivers SIGSEGV which will kill both the exec'ing process and because
>> the mm is shared the process that called vfork as well.
>> 
>> As far as I can tell this behavior is a violation of people's
>> reasonable expectations, POSIX, and is unnecessarily fragile when the
>> system is low on memory.
>> 
>> Solve this by making a userspace visible change to only kill a single
>> process/thread group.  This is possible because Jann Horn recently
>> modified[1] the coredump code so that the mm can safely be modified
>> while the coredump is happening.  With LinuxThreads long gone I don't
>> expect anyone to have a notice this behavior change in practice.
>> 
>> To accomplish this move the core_state pointer from mm_struct to
>> signal_struct, which allows different thread groups to coredump
>> simultatenously.
>> 
>> In zap_threads remove the work to kill anything except for the current
>> thread group.
>> 
>> [1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
>> Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
>> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> This looks correct to me, but depends on the 5/6 not introducing any
> races. So, to that end:
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> If you have some local tools you've been using for testing this series,
> can you toss them into tools/testing/selftests/ptrace/ ? I can help
> clean them up if want.

I just have a program that goes multi-thread and calls abort, and a
slight variant of it that calls vfork before calling abort.

It is enough to exercise the code and verify I didn't make any typos.

I have attached the code below.  If you can help make it into a proper
test that would be great.  I have just been manually running gdb
and the like to verify the kernel works as expected.

Eric


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: threaded-coredump.c --]
[-- Type: text/x-csrc, Size: 1346 bytes --]

#include <stdio.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>

struct params {
	int argc;
	char **argv;
	char **envp;
	pthread_t parent;
	pthread_t sibling[30];
};

static void *dump_thread(void *arg)
{
	struct params *params = arg;
	void *retval;
	int i;

	pthread_join(params->parent, &retval);
	fprintf(stdout, "Waiting for 5s\n");
	sleep(5);
	fprintf(stdout, "Dumping\n");
	abort();
	fprintf(stdout, "Abort Failed: %d %s\n", errno, strerror(errno));
	for (i = 0; i <= 29; i++) {
		pthread_join(params->sibling[i], &retval);
	}
	fprintf(stdout, "All Done!\n");
	_exit(EXIT_FAILURE);
	return NULL;
}

static void *idle_thread(void *arg)
{
	unsigned long i = (unsigned long)arg;
	sleep(10);
	fprintf(stdout, "Done %lu\n", i);
	fflush(stdout);
	return NULL;
}

int main(int argc, char **argv, char **envp)
{
	struct params *params;
	pthread_t pt;
	unsigned long i;

	params = malloc(sizeof(struct params));
	params->argc = argc - 1;
	params->argv = argv = argv + 1;
	params->envp = envp;
	params->parent = pthread_self();

	pthread_create(&pt, NULL, dump_thread, params);
	for (i = 0; i <= 29; i++)
		pthread_create(&params->sibling[i], NULL, idle_thread, (void *)i);
	pthread_exit(NULL);

	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] coredump: Limit coredumps to a single thread group
  2021-09-24  0:12 ` [PATCH 6/6] coredump: Limit coredumps to a single thread group Eric W. Biederman
  2021-09-24 18:56   ` Kees Cook
@ 2021-11-19 16:03   ` Kyle Huey
  2021-11-19 17:38     ` Eric W. Biederman
  1 sibling, 1 reply; 23+ messages in thread
From: Kyle Huey @ 2021-11-19 16:03 UTC (permalink / raw)
  To: Eric W . Biederman
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api,
	Kees Cook, Robert O'Callahan

> Solve this by making a userspace visible change to only kill a single
> process/thread group. [...] With LinuxThreads long gone I don't
> expect anyone to have a notice this behavior change in practice.

FWIW rr's test suite does have an explicit test that we correctly record and 
replay the old behavior. We don't actually care what that behavior is though,
so we will update our tests.

- Kyle

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] coredump: Limit coredumps to a single thread group
  2021-11-19 16:03   ` Kyle Huey
@ 2021-11-19 17:38     ` Eric W. Biederman
  0 siblings, 0 replies; 23+ messages in thread
From: Eric W. Biederman @ 2021-11-19 17:38 UTC (permalink / raw)
  To: Kyle Huey
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, Al Viro, linux-api,
	Kees Cook, Robert O'Callahan

Kyle Huey <me@kylehuey.com> writes:

>> Solve this by making a userspace visible change to only kill a single
>> process/thread group. [...] With LinuxThreads long gone I don't
>> expect anyone to have a notice this behavior change in practice.
>
> FWIW rr's test suite does have an explicit test that we correctly record and 
> replay the old behavior. We don't actually care what that behavior is though,
> so we will update our tests.

Thanks.  I am keep being afraid someone will have a program where it
actually matters.  But so far so good.

Eric


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-11-19 17:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24  0:08 [PATCH 0/6] per signal_struct coredumps Eric W. Biederman
2021-09-24  0:09 ` [PATCH 1/6] signal: Remove the bogus sigkill_pending in ptrace_stop Eric W. Biederman
2021-09-24 15:22   ` Kees Cook
2021-09-24 15:48     ` Eric W. Biederman
2021-09-24 19:06       ` Kees Cook
2021-09-24  0:10 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman
2021-09-24 15:26   ` Kees Cook
2021-09-24  0:10 ` [PATCH 3/6] exec: Check for a pending fatal signal instead of core_state Eric W. Biederman
2021-09-24 15:38   ` Kees Cook
2021-09-24  0:11 ` [PATCH 4/6] exit: Factor coredump_exit_mm out of exit_mm Eric W. Biederman
2021-09-24 18:28   ` Kees Cook
2021-09-24  0:11 ` [PATCH 5/6] coredump: Don't perform any cleanups before dumping core Eric W. Biederman
2021-09-24 18:51   ` Kees Cook
2021-09-24 21:28     ` Eric W. Biederman
2021-09-24 21:41       ` Kees Cook
2021-09-24  0:12 ` [PATCH 6/6] coredump: Limit coredumps to a single thread group Eric W. Biederman
2021-09-24 18:56   ` Kees Cook
2021-10-06 17:03     ` Eric W. Biederman
2021-11-19 16:03   ` Kyle Huey
2021-11-19 17:38     ` Eric W. Biederman
2021-09-24  5:59 ` [PATCH 0/6] per signal_struct coredumps Kees Cook
2021-09-24 14:00   ` Eric W. Biederman
2021-09-24 15:22 ` [PATCH 2/6] ptrace: Remove the unnecessary arguments from arch_ptrace_stop Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).