linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite
@ 2022-04-12 11:44 Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Hi all,

Reviving the freezer rewrite that was languishing, reinvigorated by the
recent ptrace-vs-PREEMPT_RT trouble.

The first patch adds additional state to signal/ptrace stop states, this state
is then used to fix the PREEMPT_RT issue (patch #2) and allow the new freezer
to recover the special stop states (patch #5).

I'm not completely happy with the ptrace solution, but I think it solves all
the various issues I found. I still dislike wait_task_inactive() and now we
have a second copy of that :/

Please consider carefully..

---
 drivers/acpi/x86/s2idle.c         |  12 +-
 drivers/android/binder.c          |   4 +-
 drivers/media/pci/pt3/pt3.c       |   4 +-
 drivers/scsi/scsi_transport_spi.c |   7 +-
 fs/cifs/inode.c                   |   4 +-
 fs/cifs/transport.c               |   5 +-
 fs/coredump.c                     |   5 +-
 fs/nfs/file.c                     |   3 +-
 fs/nfs/inode.c                    |  12 +-
 fs/nfs/nfs3proc.c                 |   3 +-
 fs/nfs/nfs4proc.c                 |  14 +--
 fs/nfs/nfs4state.c                |   3 +-
 fs/nfs/pnfs.c                     |   4 +-
 fs/xfs/xfs_trans_ail.c            |   8 +-
 include/linux/completion.h        |   1 +
 include/linux/freezer.h           | 244 ++------------------------------------
 include/linux/sched.h             |  49 ++++----
 include/linux/sched/jobctl.h      |   8 ++
 include/linux/sched/signal.h      |  15 ++-
 include/linux/sunrpc/sched.h      |   7 +-
 include/linux/suspend.h           |   8 +-
 include/linux/umh.h               |   9 +-
 include/linux/wait.h              |  40 ++++++-
 init/do_mounts_initrd.c           |  10 +-
 kernel/cgroup/legacy_freezer.c    |  23 ++--
 kernel/exit.c                     |   4 +-
 kernel/fork.c                     |   5 +-
 kernel/freezer.c                  | 139 ++++++++++++++++------
 kernel/futex/waitwake.c           |   8 +-
 kernel/hung_task.c                |   4 +-
 kernel/power/hibernate.c          |  35 ++++--
 kernel/power/main.c               |  18 +--
 kernel/power/process.c            |  10 +-
 kernel/power/suspend.c            |  12 +-
 kernel/power/user.c               |  24 ++--
 kernel/ptrace.c                   | 185 +++++++++++++++++++++++------
 kernel/sched/completion.c         |   9 ++
 kernel/sched/core.c               |  24 ++--
 kernel/signal.c                   |  23 ++--
 kernel/time/hrtimer.c             |   4 +-
 kernel/umh.c                      |  18 ++-
 mm/khugepaged.c                   |   4 +-
 net/sunrpc/sched.c                |  12 +-
 net/unix/af_unix.c                |   8 +-
 44 files changed, 541 insertions(+), 507 deletions(-)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
@ 2022-04-12 11:44 ` Peter Zijlstra
  2022-04-13 13:29   ` Oleg Nesterov
  2022-04-12 11:44 ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Currently ptrace_stop() / do_signal_stop() rely on the special states
TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this
state exists only in task->__state and nowhere else.

There's two spots of bother with this:

 - PREEMPT_RT has task->saved_state which complicates matters,
   meaning task_is_{traced,stopped}() needs to check an additional
   variable.

 - An alternative freezer implementation that itself relies on a
   special TASK state would loose TASK_TRACED/TASK_STOPPED and will
   result in misbehaviour.

As such, add additional state to task->jobctl to track this state
outside of task->__state.

NOTE: this doesn't actually fix anything yet, just adds extra state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h        |    8 +++-----
 include/linux/sched/jobctl.h |    8 ++++++++
 include/linux/sched/signal.h |   15 ++++++++++++++-
 kernel/ptrace.c              |   18 ++++++++++++++----
 kernel/signal.c              |    9 ++++++---
 5 files changed, 45 insertions(+), 13 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -118,11 +118,9 @@ struct task_group;
 
 #define task_is_running(task)		(READ_ONCE((task)->__state) == TASK_RUNNING)
 
-#define task_is_traced(task)		((READ_ONCE(task->__state) & __TASK_TRACED) != 0)
-
-#define task_is_stopped(task)		((READ_ONCE(task->__state) & __TASK_STOPPED) != 0)
-
-#define task_is_stopped_or_traced(task)	((READ_ONCE(task->__state) & (__TASK_STOPPED | __TASK_TRACED)) != 0)
+#define task_is_traced(task)		((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
+#define task_is_stopped(task)		((READ_ONCE(task->jobctl) & JOBCTL_STOPPED) != 0)
+#define task_is_stopped_or_traced(task)	((READ_ONCE(task->jobctl) & (JOBCTL_STOPPED | JOBCTL_TRACED)) != 0)
 
 /*
  * Special states are those that do not use the normal wait-loop pattern. See
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -20,6 +20,10 @@ struct task_struct;
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
 
+#define JOBCTL_STOPPED_BIT	24
+#define JOBCTL_TRACED_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 26
+
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
 #define JOBCTL_STOP_CONSUME	(1UL << JOBCTL_STOP_CONSUME_BIT)
@@ -29,6 +33,10 @@ struct task_struct;
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
 
+#define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
+#define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
+
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
 
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -294,8 +294,10 @@ static inline int kernel_dequeue_signal(
 static inline void kernel_signal_stop(void)
 {
 	spin_lock_irq(&current->sighand->siglock);
-	if (current->jobctl & JOBCTL_STOP_DEQUEUED)
+	if (current->jobctl & JOBCTL_STOP_DEQUEUED) {
+		current->jobctl |= JOBCTL_STOPPED;
 		set_special_state(TASK_STOPPED);
+	}
 	spin_unlock_irq(&current->sighand->siglock);
 
 	schedule();
@@ -437,10 +439,21 @@ extern void signal_wake_up_state(struct
 
 static inline void signal_wake_up(struct task_struct *t, bool resume)
 {
+	lockdep_assert_held(&t->sighand->siglock);
+
+	if (resume && !(t->jobctl & JOBCTL_TRACED_FROZEN))
+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
+
 	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
 }
+
 static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 {
+	lockdep_assert_held(&t->sighand->siglock);
+
+	if (resume)
+		t->jobctl &= ~JOBCTL_TRACED;
+
 	signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
 }
 
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -185,7 +185,12 @@ static bool looks_like_a_spurious_pid(st
 	return true;
 }
 
-/* Ensure that nothing can wake it up, even SIGKILL */
+/*
+ * Ensure that nothing can wake it up, even SIGKILL
+ *
+ * A task is switched to this state while a ptrace operation is in progress;
+ * such that the ptrace operation is uninterruptible.
+ */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
 	bool ret = false;
@@ -197,6 +202,7 @@ static bool ptrace_freeze_traced(struct
 	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
+		task->jobctl |= JOBCTL_TRACED_FROZEN;
 		WRITE_ONCE(task->__state, __TASK_TRACED);
 		ret = true;
 	}
@@ -218,9 +224,11 @@ static void ptrace_unfreeze_traced(struc
 	 */
 	spin_lock_irq(&task->sighand->siglock);
 	if (READ_ONCE(task->__state) == __TASK_TRACED) {
-		if (__fatal_signal_pending(task))
+		task->jobctl &= ~JOBCTL_TRACED_FROZEN;
+		if (__fatal_signal_pending(task)) {
+			task->jobctl &= ~JOBCTL_TRACED;
 			wake_up_state(task, __TASK_TRACED);
-		else
+		} else
 			WRITE_ONCE(task->__state, TASK_TRACED);
 	}
 	spin_unlock_irq(&task->sighand->siglock);
@@ -475,8 +483,10 @@ static int ptrace_attach(struct task_str
 	 * in and out of STOPPED are protected by siglock.
 	 */
 	if (task_is_stopped(task) &&
-	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING))
+	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) {
+		task->jobctl &= ~JOBCTL_STOPPED;
 		signal_wake_up_state(task, __TASK_STOPPED);
+	}
 
 	spin_unlock(&task->sighand->siglock);
 
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -884,7 +884,7 @@ static int check_kill_permission(int sig
 static void ptrace_trap_notify(struct task_struct *t)
 {
 	WARN_ON_ONCE(!(t->ptrace & PT_SEIZED));
-	assert_spin_locked(&t->sighand->siglock);
+	lockdep_assert_held(&t->sighand->siglock);
 
 	task_set_jobctl_pending(t, JOBCTL_TRAP_NOTIFY);
 	ptrace_signal_wake_up(t, t->jobctl & JOBCTL_LISTENING);
@@ -930,9 +930,10 @@ static bool prepare_signal(int sig, stru
 		for_each_thread(p, t) {
 			flush_sigqueue_mask(&flush, &t->pending);
 			task_clear_jobctl_pending(t, JOBCTL_STOP_PENDING);
-			if (likely(!(t->ptrace & PT_SEIZED)))
+			if (likely(!(t->ptrace & PT_SEIZED))) {
+				t->jobctl &= ~JOBCTL_STOPPED;
 				wake_up_state(t, __TASK_STOPPED);
-			else
+			} else
 				ptrace_trap_notify(t);
 		}
 
@@ -2219,6 +2220,7 @@ static int ptrace_stop(int exit_code, in
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
+	current->jobctl |= JOBCTL_TRACED;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2460,6 +2462,7 @@ static bool do_signal_stop(int signr)
 		if (task_participate_group_stop(current))
 			notify = CLD_STOPPED;
 
+		current->jobctl |= JOBCTL_STOPPED;
 		set_special_state(TASK_STOPPED);
 		spin_unlock_irq(&current->sighand->siglock);
 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Peter Zijlstra
@ 2022-04-12 11:44 ` Peter Zijlstra
  2022-04-13 13:24   ` Oleg Nesterov
  2022-04-12 11:44 ` [PATCH 3/5] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Currently ptrace_check_attach() / ptrace_freeze_traced() rely on
specific scheduler behaviour to freeze the task. In specific, it
relies on there not being any blocking behaviour between:

  set_special_state(TASK_TRACED);
  ...
  schedule();

specifically it relies on being able to change p->__state between
those two points (to clear/set TASK_WAKEKILL) and relies on
wait_task_inactive() to ensure the task has hit that schedule().

However, PREEMPT_RT is breaking this by introducing sleeping
spinlocks. The consequence of sleeping spinlocks is that p->__state
can change (also see p->saved_state) and that a task can be inactive
(off the runqueue) and *NOT* be at the schedule() as expected.

That is, both the p->__state and wait_task_inactive() usage are
broken.

In order to avoid having to deal with p->saved_state, flip the
wait_task_inactive() and p->__state poking. That is, first wait for
the victim to be in schedule() and then poke p->__state, which is in a
known state by then.

The only problem with this is that it's possible to race with a
concurrent ptrace_detach()+pthread_attach() landing back in the same
situation as before. To deal with this, compare the task's nvcsw
count to make sure there is no scheduling between the initial
quiescence and the final task state poking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/ptrace.c     |  175 +++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/core.c |    5 -
 2 files changed, 141 insertions(+), 39 deletions(-)

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -186,31 +186,13 @@ static bool looks_like_a_spurious_pid(st
 }
 
 /*
- * Ensure that nothing can wake it up, even SIGKILL
+ * Ptrace operation is complete, re-allow TASK_WAKEKILL.
  *
- * A task is switched to this state while a ptrace operation is in progress;
- * such that the ptrace operation is uninterruptible.
+ * Unfreeze is easy, since ptrace_check_attach() guarantees the task is off the
+ * runqueue and __TASK_TRACED. If it's still __TASK_TRACED holding
+ * sighand->siglock serializes against ptrace_signal_wake_up() and we can
+ * simply put TASK_WAKEKILL back (or wake because there's a pending fatal).
  */
-static bool ptrace_freeze_traced(struct task_struct *task)
-{
-	bool ret = false;
-
-	/* Lockless, nobody but us can set this flag */
-	if (task->jobctl & JOBCTL_LISTENING)
-		return ret;
-
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
-	    !__fatal_signal_pending(task)) {
-		task->jobctl |= JOBCTL_TRACED_FROZEN;
-		WRITE_ONCE(task->__state, __TASK_TRACED);
-		ret = true;
-	}
-	spin_unlock_irq(&task->sighand->siglock);
-
-	return ret;
-}
-
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
 	if (READ_ONCE(task->__state) != __TASK_TRACED)
@@ -234,6 +216,94 @@ static void ptrace_unfreeze_traced(struc
 	spin_unlock_irq(&task->sighand->siglock);
 }
 
+/*
+ * In order to start a ptrace operation the victim task must be off the
+ * runqueue in state __TASK_TRACED.
+ */
+static int __ptrace_freeze_cond(struct task_struct *p)
+{
+	if (!task_is_traced(p))
+		return -ESRCH;
+
+	if (task_curr(p))
+		return -EINPROGRESS;
+
+	if (p->on_rq)
+		return -EWOULDBLOCK;
+
+	/*
+	 * Due to PREEMPT_RT it is possible the task is blocked on a spinlock
+	 * in state TASK_RTLOCK_WAIT, if so, gotta wait until that goes away.
+	 */
+	if (!(READ_ONCE(p->__state) & __TASK_TRACED))
+		return -EWOULDBLOCK;
+
+	return 0;
+}
+
+/*
+ * Returns:
+ *  0:		  task is off runqueue in TASK_TRACED
+ *  -ESRCH:	  not traced
+ *  -EINPROGRESS: still running
+ *  -EWOULDBLOCK: not running
+ */
+static int __ptrace_pre_freeze(struct task_struct *p, void *arg)
+{
+	int ret;
+
+	ret = __ptrace_freeze_cond(p);
+	if (ret)
+		return ret;
+
+	*(unsigned long *)arg = p->nvcsw;
+
+	return 0;
+}
+
+/*
+ * Returns:
+ *  0:		  task is off runqueue, now __TASK_TRACED
+ *  -ESRCH:	  not traced, or scheduled since pre_freeze
+ *  -GAIN:	  still running
+ *  -EWOULDBLOCK: not running
+ */
+static int __ptrace_freeze(struct task_struct *p, void *arg)
+{
+	int ret;
+
+	ret = __ptrace_freeze_cond(p);
+	if (ret)
+		return ret;
+
+	/*
+	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
+	 * anymore.
+	 */
+	if (*(unsigned long *)arg != p->nvcsw)
+		return -ESRCH;
+
+	if (looks_like_a_spurious_pid(p))
+		return -ESRCH;
+
+	if (__fatal_signal_pending(p))
+		return -ESRCH;
+
+	/*
+	 * we hold:
+	 *
+	 *  - tasklist_lock       (avoids ptrace_detach)
+	 *  - p->sighand->siglock (avoids ptrace_signal_wake_up)
+	 *  - p->pi_lock          (avoids anything scheduler)
+	 *
+	 * task is absolutely frozen, now we can safely take out
+	 * TASK_WAKEKILL.
+	 */
+	p->jobctl |= JOBCTL_TRACED_FROZEN;
+	WRITE_ONCE(p->__state, __TASK_TRACED);
+	return 0;
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -253,7 +323,36 @@ static void ptrace_unfreeze_traced(struc
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
+	ktime_t to = TICK_NSEC;
+	unsigned long nvcsw;
+	int ret;
+
+	if (child == current)
+		return -ESRCH;
+
+	if (!ignore_state) for (;;) {
+		/*
+		 * Ensure this child is a quiescent TASK_TRACED task.
+		 */
+		ret = task_call_func(child, __ptrace_pre_freeze, &nvcsw);
+		switch (ret) {
+		case 0:
+			break;
+		case -ESRCH:
+			return ret;
+		case -EINPROGRESS:
+			while (task_curr(child))
+				cpu_relax();
+			continue;
+		case -EWOULDBLOCK:
+			/*
+			 * XXX horrible, randomly wait 1 tick and try again.
+			 */
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);
+			continue;
+		}
+	}
 
 	/*
 	 * We take the read lock around doing both checks to close a
@@ -262,29 +361,35 @@ static int ptrace_check_attach(struct ta
 	 * we are sure that this is our traced child and that can only
 	 * be changed by us so it's not changing right after this.
 	 */
+	ret = -ESRCH;
 	read_lock(&tasklist_lock);
 	if (child->ptrace && child->parent == current) {
 		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+		if (ignore_state) {
+			ret = 0;
+			goto unlock;
+		}
+
+		if (child->jobctl & JOBCTL_LISTENING)
+			goto unlock;
+
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
-	read_unlock(&tasklist_lock);
-
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
+		spin_lock_irq(&child->sighand->siglock);
+		ret = task_call_func(child, __ptrace_freeze, &nvcsw);
+		if (ret) {
 			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
+			 * If *anything* changed since __ptrace_pre_freeze()
+			 * this fails.
 			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
 			ret = -ESRCH;
 		}
+		spin_unlock_irq(&child->sighand->siglock);
 	}
+unlock:
+	read_unlock(&tasklist_lock);
 
 	return ret;
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6310,10 +6310,7 @@ static void __sched notrace __schedule(u
 
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that:
-	 *
-	 *  - we form a control dependency vs deactivate_task() below.
-	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
+	 * that we form a control dependency vs deactivate_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/5] freezer: Have {,un}lock_system_sleep() save/restore flags
  2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
@ 2022-04-12 11:44 ` Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 4/5] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 5/5] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  4 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Rafael explained that the reason for having both PF_NOFREEZE and
PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from
kthread context that has previously called set_freezable().

In preparation of merging the flags, have {,un}lock_system_slee() save
and restore current->flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/acpi/x86/s2idle.c         |   12 ++++++++----
 drivers/scsi/scsi_transport_spi.c |    7 ++++---
 include/linux/suspend.h           |    8 ++++----
 kernel/power/hibernate.c          |   35 ++++++++++++++++++++++-------------
 kernel/power/main.c               |   16 ++++++++++------
 kernel/power/suspend.c            |   12 ++++++++----
 kernel/power/user.c               |   24 ++++++++++++++----------
 7 files changed, 70 insertions(+), 44 deletions(-)

--- a/drivers/acpi/x86/s2idle.c
+++ b/drivers/acpi/x86/s2idle.c
@@ -538,12 +538,14 @@ void acpi_s2idle_setup(void)
 
 int acpi_register_lps0_dev(struct acpi_s2idle_dev_ops *arg)
 {
+	unsigned int sleep_flags;
+
 	if (!lps0_device_handle || sleep_no_lps0)
 		return -ENODEV;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	list_add(&arg->list_node, &lps0_s2idle_devops_head);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return 0;
 }
@@ -551,12 +553,14 @@ EXPORT_SYMBOL_GPL(acpi_register_lps0_dev
 
 void acpi_unregister_lps0_dev(struct acpi_s2idle_dev_ops *arg)
 {
+	unsigned int sleep_flags;
+
 	if (!lps0_device_handle || sleep_no_lps0)
 		return;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	list_del(&arg->list_node);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(acpi_unregister_lps0_dev);
 
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -998,8 +998,9 @@ void
 spi_dv_device(struct scsi_device *sdev)
 {
 	struct scsi_target *starget = sdev->sdev_target;
-	u8 *buffer;
 	const int len = SPI_MAX_ECHO_BUFFER_SIZE*2;
+	unsigned int sleep_flags;
+	u8 *buffer;
 
 	/*
 	 * Because this function and the power management code both call
@@ -1007,7 +1008,7 @@ spi_dv_device(struct scsi_device *sdev)
 	 * while suspend or resume is in progress. Hence the
 	 * lock/unlock_system_sleep() calls.
 	 */
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (scsi_autopm_get_device(sdev))
 		goto unlock_system_sleep;
@@ -1058,7 +1059,7 @@ spi_dv_device(struct scsi_device *sdev)
 	scsi_autopm_put_device(sdev);
 
 unlock_system_sleep:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL(spi_dv_device);
 
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -510,8 +510,8 @@ extern bool pm_save_wakeup_count(unsigne
 extern void pm_wakep_autosleep_enabled(bool set);
 extern void pm_print_active_wakeup_sources(void);
 
-extern void lock_system_sleep(void);
-extern void unlock_system_sleep(void);
+extern unsigned int lock_system_sleep(void);
+extern void unlock_system_sleep(unsigned int);
 
 #else /* !CONFIG_PM_SLEEP */
 
@@ -534,8 +534,8 @@ static inline void pm_system_wakeup(void
 static inline void pm_wakeup_clear(bool reset) {}
 static inline void pm_system_irq_wakeup(unsigned int irq_number) {}
 
-static inline void lock_system_sleep(void) {}
-static inline void unlock_system_sleep(void) {}
+static inline unsigned int lock_system_sleep(void) { return 0; }
+static inline void unlock_system_sleep(unsigned int flags) {}
 
 #endif /* !CONFIG_PM_SLEEP */
 
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -92,20 +92,24 @@ bool hibernation_available(void)
  */
 void hibernation_set_ops(const struct platform_hibernation_ops *ops)
 {
+	unsigned int sleep_flags;
+
 	if (ops && !(ops->begin && ops->end &&  ops->pre_snapshot
 	    && ops->prepare && ops->finish && ops->enter && ops->pre_restore
 	    && ops->restore_cleanup && ops->leave)) {
 		WARN_ON(1);
 		return;
 	}
-	lock_system_sleep();
+
+	sleep_flags = lock_system_sleep();
+
 	hibernation_ops = ops;
 	if (ops)
 		hibernation_mode = HIBERNATION_PLATFORM;
 	else if (hibernation_mode == HIBERNATION_PLATFORM)
 		hibernation_mode = HIBERNATION_SHUTDOWN;
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(hibernation_set_ops);
 
@@ -713,6 +717,7 @@ static int load_image_and_restore(void)
 int hibernate(void)
 {
 	bool snapshot_test = false;
+	unsigned int sleep_flags;
 	int error;
 
 	if (!hibernation_available()) {
@@ -720,7 +725,7 @@ int hibernate(void)
 		return -EPERM;
 	}
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	/* The snapshot device should not be opened while we're running */
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -794,7 +799,7 @@ int hibernate(void)
 	pm_restore_console();
 	hibernate_release();
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 	pr_info("hibernation exit\n");
 
 	return error;
@@ -809,9 +814,10 @@ int hibernate(void)
  */
 int hibernate_quiet_exec(int (*func)(void *data), void *data)
 {
+	unsigned int sleep_flags;
 	int error;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -891,7 +897,7 @@ int hibernate_quiet_exec(int (*func)(voi
 	hibernate_release();
 
 unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error;
 }
@@ -1100,11 +1106,12 @@ static ssize_t disk_show(struct kobject
 static ssize_t disk_store(struct kobject *kobj, struct kobj_attribute *attr,
 			  const char *buf, size_t n)
 {
+	int mode = HIBERNATION_INVALID;
+	unsigned int sleep_flags;
 	int error = 0;
-	int i;
 	int len;
 	char *p;
-	int mode = HIBERNATION_INVALID;
+	int i;
 
 	if (!hibernation_available())
 		return -EPERM;
@@ -1112,7 +1119,7 @@ static ssize_t disk_store(struct kobject
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
 		if (len == strlen(hibernation_modes[i])
 		    && !strncmp(buf, hibernation_modes[i], len)) {
@@ -1142,7 +1149,7 @@ static ssize_t disk_store(struct kobject
 	if (!error)
 		pm_pr_dbg("Hibernation mode set to '%s'\n",
 			       hibernation_modes[mode]);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 	return error ? error : n;
 }
 
@@ -1158,9 +1165,10 @@ static ssize_t resume_show(struct kobjec
 static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr,
 			    const char *buf, size_t n)
 {
-	dev_t res;
+	unsigned int sleep_flags;
 	int len = n;
 	char *name;
+	dev_t res;
 
 	if (len && buf[len-1] == '\n')
 		len--;
@@ -1173,9 +1181,10 @@ static ssize_t resume_store(struct kobje
 	if (!res)
 		return -EINVAL;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	swsusp_resume_device = res;
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
+
 	pm_pr_dbg("Configured hibernation resume from disk to %u\n",
 		  swsusp_resume_device);
 	noresume = 0;
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -21,14 +21,16 @@
 
 #ifdef CONFIG_PM_SLEEP
 
-void lock_system_sleep(void)
+unsigned int lock_system_sleep(void)
 {
+	unsigned int flags = current->flags;
 	current->flags |= PF_FREEZER_SKIP;
 	mutex_lock(&system_transition_mutex);
+	return flags;
 }
 EXPORT_SYMBOL_GPL(lock_system_sleep);
 
-void unlock_system_sleep(void)
+void unlock_system_sleep(unsigned int flags)
 {
 	/*
 	 * Don't use freezer_count() because we don't want the call to
@@ -46,7 +48,8 @@ void unlock_system_sleep(void)
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	current->flags &= ~PF_FREEZER_SKIP;
+	if (!(flags & PF_FREEZER_SKIP))
+		current->flags &= ~PF_FREEZER_SKIP;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
@@ -260,16 +263,17 @@ static ssize_t pm_test_show(struct kobje
 static ssize_t pm_test_store(struct kobject *kobj, struct kobj_attribute *attr,
 				const char *buf, size_t n)
 {
+	unsigned int sleep_flags;
 	const char * const *s;
+	int error = -EINVAL;
 	int level;
 	char *p;
 	int len;
-	int error = -EINVAL;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	level = TEST_FIRST;
 	for (s = &pm_tests[level]; level <= TEST_MAX; s++, level++)
@@ -279,7 +283,7 @@ static ssize_t pm_test_store(struct kobj
 			break;
 		}
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error ? error : n;
 }
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -75,9 +75,11 @@ EXPORT_SYMBOL_GPL(pm_suspend_default_s2i
 
 void s2idle_set_ops(const struct platform_s2idle_ops *ops)
 {
-	lock_system_sleep();
+	unsigned int sleep_flags;
+
+	sleep_flags = lock_system_sleep();
 	s2idle_ops = ops;
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 
 static void s2idle_begin(void)
@@ -200,7 +202,9 @@ __setup("mem_sleep_default=", mem_sleep_
  */
 void suspend_set_ops(const struct platform_suspend_ops *ops)
 {
-	lock_system_sleep();
+	unsigned int sleep_flags;
+
+	sleep_flags = lock_system_sleep();
 
 	suspend_ops = ops;
 
@@ -216,7 +220,7 @@ void suspend_set_ops(const struct platfo
 			mem_sleep_current = PM_SUSPEND_MEM;
 	}
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(suspend_set_ops);
 
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -46,12 +46,13 @@ int is_hibernate_resume_dev(dev_t dev)
 static int snapshot_open(struct inode *inode, struct file *filp)
 {
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 	int error;
 
 	if (!hibernation_available())
 		return -EPERM;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -97,7 +98,7 @@ static int snapshot_open(struct inode *i
 	data->dev = 0;
 
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error;
 }
@@ -105,8 +106,9 @@ static int snapshot_open(struct inode *i
 static int snapshot_release(struct inode *inode, struct file *filp)
 {
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	swsusp_free();
 	data = filp->private_data;
@@ -123,7 +125,7 @@ static int snapshot_release(struct inode
 			PM_POST_HIBERNATION : PM_POST_RESTORE);
 	hibernate_release();
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return 0;
 }
@@ -131,11 +133,12 @@ static int snapshot_release(struct inode
 static ssize_t snapshot_read(struct file *filp, char __user *buf,
                              size_t count, loff_t *offp)
 {
+	loff_t pg_offp = *offp & ~PAGE_MASK;
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 	ssize_t res;
-	loff_t pg_offp = *offp & ~PAGE_MASK;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	data = filp->private_data;
 	if (!data->ready) {
@@ -156,7 +159,7 @@ static ssize_t snapshot_read(struct file
 		*offp += res;
 
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return res;
 }
@@ -164,11 +167,12 @@ static ssize_t snapshot_read(struct file
 static ssize_t snapshot_write(struct file *filp, const char __user *buf,
                               size_t count, loff_t *offp)
 {
+	loff_t pg_offp = *offp & ~PAGE_MASK;
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 	ssize_t res;
-	loff_t pg_offp = *offp & ~PAGE_MASK;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	data = filp->private_data;
 
@@ -190,7 +194,7 @@ static ssize_t snapshot_write(struct fil
 	if (res > 0)
 		*offp += res;
 unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return res;
 }



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 4/5] freezer,umh: Clean up freezer/initrd interaction
  2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
                   ` (2 preceding siblings ...)
  2022-04-12 11:44 ` [PATCH 3/5] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
@ 2022-04-12 11:44 ` Peter Zijlstra
  2022-04-12 11:44 ` [PATCH 5/5] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  4 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure
that the UMH, which is going to freeze the system, doesn't
indefinitely wait for it's caller.

Rework things by adding UMH_FREEZABLE to indicate the completion is
freezable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/umh.h     |    9 +++++----
 init/do_mounts_initrd.c |   10 +---------
 kernel/umh.c            |    8 ++++++++
 3 files changed, 14 insertions(+), 13 deletions(-)

--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -11,10 +11,11 @@
 struct cred;
 struct file;
 
-#define UMH_NO_WAIT	0	/* don't wait at all */
-#define UMH_WAIT_EXEC	1	/* wait for the exec, but not the process */
-#define UMH_WAIT_PROC	2	/* wait for the process to complete */
-#define UMH_KILLABLE	4	/* wait for EXEC/PROC killable */
+#define UMH_NO_WAIT	0x00	/* don't wait at all */
+#define UMH_WAIT_EXEC	0x01	/* wait for the exec, but not the process */
+#define UMH_WAIT_PROC	0x02	/* wait for the process to complete */
+#define UMH_KILLABLE	0x04	/* wait for EXEC/PROC killable */
+#define UMH_FREEZABLE	0x08	/* wait for EXEC/PROC freezable */
 
 struct subprocess_info {
 	struct work_struct work;
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -79,19 +79,11 @@ static void __init handle_initrd(void)
 	init_mkdir("/old", 0700);
 	init_chdir("/old");
 
-	/*
-	 * In case that a resume from disk is carried out by linuxrc or one of
-	 * its children, we need to tell the freezer not to wait for us.
-	 */
-	current->flags |= PF_FREEZER_SKIP;
-
 	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
 					 GFP_KERNEL, init_linuxrc, NULL, NULL);
 	if (!info)
 		return;
-	call_usermodehelper_exec(info, UMH_WAIT_PROC);
-
-	current->flags &= ~PF_FREEZER_SKIP;
+	call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
 
 	/* move initrd to rootfs' /old */
 	init_mount("..", ".", NULL, MS_MOVE, NULL);
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -28,6 +28,7 @@
 #include <linux/async.h>
 #include <linux/uaccess.h>
 #include <linux/initrd.h>
+#include <linux/freezer.h>
 
 #include <trace/events/module.h>
 
@@ -436,6 +437,9 @@ int call_usermodehelper_exec(struct subp
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
 
+	if (wait & UMH_FREEZABLE)
+		freezer_do_not_count();
+
 	if (wait & UMH_KILLABLE) {
 		retval = wait_for_completion_killable(&done);
 		if (!retval)
@@ -448,6 +452,10 @@ int call_usermodehelper_exec(struct subp
 	}
 
 	wait_for_completion(&done);
+
+	if (wait & UMH_FREEZABLE)
+		freezer_count();
+
 wait_done:
 	retval = sub_info->retval;
 out:



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 5/5] freezer,sched: Rewrite core freezer logic
  2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
                   ` (3 preceding siblings ...)
  2022-04-12 11:44 ` [PATCH 4/5] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
@ 2022-04-12 11:44 ` Peter Zijlstra
  4 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-12 11:44 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Rewrite the core freezer to behave better wrt thawing and be simpler
in general.

By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay).

Specifically; the current scheme works a little like:

	freezer_do_not_count();
	schedule();
	freezer_count();

And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.

However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.

That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.

This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/android/binder.c       |    4 
 drivers/media/pci/pt3/pt3.c    |    4 
 fs/cifs/inode.c                |    4 
 fs/cifs/transport.c            |    5 
 fs/coredump.c                  |    5 
 fs/nfs/file.c                  |    3 
 fs/nfs/inode.c                 |   12 --
 fs/nfs/nfs3proc.c              |    3 
 fs/nfs/nfs4proc.c              |   14 +-
 fs/nfs/nfs4state.c             |    3 
 fs/nfs/pnfs.c                  |    4 
 fs/xfs/xfs_trans_ail.c         |    8 -
 include/linux/completion.h     |    1 
 include/linux/freezer.h        |  244 +----------------------------------------
 include/linux/sched.h          |   41 +++---
 include/linux/sunrpc/sched.h   |    7 -
 include/linux/wait.h           |   40 +++++-
 kernel/cgroup/legacy_freezer.c |   23 +--
 kernel/exit.c                  |    4 
 kernel/fork.c                  |    5 
 kernel/freezer.c               |  139 +++++++++++++++++------
 kernel/futex/waitwake.c        |    8 -
 kernel/hung_task.c             |    4 
 kernel/power/main.c            |    6 -
 kernel/power/process.c         |   10 -
 kernel/sched/completion.c      |    9 +
 kernel/sched/core.c            |   19 ++-
 kernel/signal.c                |   14 +-
 kernel/time/hrtimer.c          |    4 
 kernel/umh.c                   |   20 +--
 mm/khugepaged.c                |    4 
 net/sunrpc/sched.c             |   12 --
 net/unix/af_unix.c             |    8 -
 33 files changed, 282 insertions(+), 409 deletions(-)

--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -4034,10 +4034,9 @@ static int binder_wait_for_work(struct b
 	struct binder_proc *proc = thread->proc;
 	int ret = 0;
 
-	freezer_do_not_count();
 	binder_inner_proc_lock(proc);
 	for (;;) {
-		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		if (binder_has_work_ilocked(thread, do_proc_work))
 			break;
 		if (do_proc_work)
@@ -4054,7 +4053,6 @@ static int binder_wait_for_work(struct b
 	}
 	finish_wait(&thread->wait, &wait);
 	binder_inner_proc_unlock(proc);
-	freezer_count();
 
 	return ret;
 }
--- a/drivers/media/pci/pt3/pt3.c
+++ b/drivers/media/pci/pt3/pt3.c
@@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
 		pt3_proc_dma(adap);
 
 		delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		freezable_schedule_hrtimeout_range(&delay,
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_hrtimeout_range(&delay,
 					PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
 					HRTIMER_MODE_REL);
 	}
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2286,7 +2286,7 @@ cifs_invalidate_mapping(struct inode *in
 static int
 cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -2304,7 +2304,7 @@ cifs_revalidate_mapping(struct inode *in
 		return 0;
 
 	rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
-				     TASK_KILLABLE);
+				     TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (rc)
 		return rc;
 
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -760,8 +760,9 @@ wait_for_response(struct TCP_Server_Info
 {
 	int error;
 
-	error = wait_event_freezekillable_unsafe(server->response_q,
-				    midQ->mid_state != MID_REQUEST_SUBMITTED);
+	error = wait_event_state(server->response_q,
+				 midQ->mid_state != MID_REQUEST_SUBMITTED,
+				 (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
 	if (error < 0)
 		return -ERESTARTSYS;
 
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -402,9 +402,8 @@ static int coredump_wait(int exit_code,
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
-		freezer_do_not_count();
-		wait_for_completion(&core_state->startup);
-		freezer_count();
+		wait_for_completion_state(&core_state->startup,
+					  TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 		/*
 		 * Wait for all the threads to become inactive, so that
 		 * all the thread context (extended register state, like
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -565,7 +565,8 @@ static vm_fault_t nfs_vm_page_mkwrite(st
 	}
 
 	wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
+			   nfs_wait_bit_killable,
+			   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 
 	lock_page(page);
 	mapping = page_file_mapping(page);
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat
 	return nfs_fileid_to_ino_t(fattr->fileid);
 }
 
-static int nfs_wait_killable(int mode)
+int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
 }
-
-int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
-{
-	return nfs_wait_killable(mode);
-}
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
 /**
@@ -1332,7 +1327,8 @@ int nfs_clear_invalid_mapping(struct add
 	 */
 	for (;;) {
 		ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
-					 nfs_wait_bit_killable, TASK_KILLABLE);
+					 nfs_wait_bit_killable,
+					 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (ret)
 			goto out;
 		spin_lock(&inode->i_lock);
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt,
 		res = rpc_call_sync(clnt, msg, flags);
 		if (res != -EJUKEBOX)
 			break;
-		freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+		__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+		schedule_timeout(NFS_JUKEBOX_RETRY_TIME);
 		res = -ERESTARTSYS;
 	} while (!fatal_signal_pending(current));
 	return res;
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -408,8 +408,8 @@ static int nfs4_delay_killable(long *tim
 {
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
+	__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!__fatal_signal_pending(current))
 		return 0;
 	return -EINTR;
@@ -419,7 +419,8 @@ static int nfs4_delay_interruptible(long
 {
 	might_sleep();
 
-	freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!signal_pending(current))
 		return 0;
 	return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
@@ -7363,7 +7364,8 @@ nfs4_retry_setlk_simple(struct nfs4_stat
 		status = nfs4_proc_setlk(state, cmd, request);
 		if ((status != -EAGAIN) || IS_SETLK(cmd))
 			break;
-		freezable_schedule_timeout_interruptible(timeout);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_timeout(timeout);
 		timeout *= 2;
 		timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
 		status = -ERESTARTSYS;
@@ -7431,10 +7433,8 @@ nfs4_retry_setlk(struct nfs4_state *stat
 			break;
 
 		status = -ERESTARTSYS;
-		freezer_do_not_count();
-		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
+		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 			   NFS4_LOCK_MAXTIMEOUT);
-		freezer_count();
 	} while (!signalled());
 
 	remove_wait_queue(q, &waiter.wait);
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1314,7 +1314,8 @@ int nfs4_wait_clnt_recover(struct nfs_cl
 
 	refcount_inc(&clp->cl_count);
 	res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
-				 nfs_wait_bit_killable, TASK_KILLABLE);
+				 nfs_wait_bit_killable,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (res)
 		goto out;
 	if (clp->cl_cons_state < 0)
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1907,7 +1907,7 @@ static int pnfs_prepare_to_retry_layoutg
 	pnfs_layoutcommit_inode(lo->plh_inode, false);
 	return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
 				   nfs_wait_bit_killable,
-				   TASK_KILLABLE);
+				   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
 
 static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
@@ -3182,7 +3182,7 @@ pnfs_layoutcommit_inode(struct inode *in
 		status = wait_on_bit_lock_action(&nfsi->flags,
 				NFS_INO_LAYOUTCOMMITTING,
 				nfs_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (status)
 			goto out;
 	}
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -602,9 +602,9 @@ xfsaild(
 
 	while (1) {
 		if (tout && tout <= 20)
-			set_current_state(TASK_KILLABLE);
+			set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
 		else
-			set_current_state(TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 		/*
 		 * Check kthread_should_stop() after we set the task state to
@@ -653,14 +653,14 @@ xfsaild(
 		    ailp->ail_target == ailp->ail_target_prev &&
 		    list_empty(&ailp->ail_buf_list)) {
 			spin_unlock(&ailp->ail_lock);
-			freezable_schedule();
+			schedule();
 			tout = 0;
 			continue;
 		}
 		spin_unlock(&ailp->ail_lock);
 
 		if (tout)
-			freezable_schedule_timeout(msecs_to_jiffies(tout));
+			schedule_timeout(msecs_to_jiffies(tout));
 
 		__set_current_state(TASK_RUNNING);
 
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
 extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
+extern int wait_for_completion_state(struct completion *x, unsigned int state);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
 						   unsigned long timeout);
 extern unsigned long wait_for_completion_io_timeout(struct completion *x,
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs
 /*
  * Check if a process has been frozen
  */
-static inline bool frozen(struct task_struct *p)
-{
-	return p->flags & PF_FROZEN;
-}
+extern bool frozen(struct task_struct *p);
 
 extern bool freezing_slow_path(struct task_struct *p);
 
@@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct ta
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
 extern void thaw_kernel_threads(void);
 
-/*
- * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
- * If try_to_freeze causes a lockdep warning it means the caller may deadlock
- */
-static inline bool try_to_freeze_unsafe(void)
+static inline bool try_to_freeze(void)
 {
 	might_sleep();
 	if (likely(!freezing(current)))
 		return false;
-	return __refrigerator(false);
-}
-
-static inline bool try_to_freeze(void)
-{
 	if (!(current->flags & PF_NOFREEZE))
 		debug_check_no_locks_held();
-	return try_to_freeze_unsafe();
+	return __refrigerator(false);
 }
 
 extern bool freeze_task(struct task_struct *p);
@@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struc
 }
 #endif /* !CONFIG_CGROUP_FREEZER */
 
-/*
- * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
- * calls wait_for_completion(&vfork) and reset right after it returns from this
- * function.  Next, the parent should call try_to_freeze() to freeze itself
- * appropriately in case the child has exited before the freezing of tasks is
- * complete.  However, we don't want kernel threads to be frozen in unexpected
- * places, so we allow them to block freeze_processes() instead or to set
- * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
- * parent won't really block freeze_processes(), since ____call_usermodehelper()
- * (the child) does a little before exec/exit and it can't be frozen before
- * waking up the parent.
- */
-
-
-/**
- * freezer_do_not_count - tell freezer to ignore %current
- *
- * Tell freezers to ignore the current task when determining whether the
- * target frozen state is reached.  IOW, the current task will be
- * considered frozen enough by freezers.
- *
- * The caller shouldn't do anything which isn't allowed for a frozen task
- * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
- * wrap a scheduling operation and nothing much else.
- */
-static inline void freezer_do_not_count(void)
-{
-	current->flags |= PF_FREEZER_SKIP;
-}
-
-/**
- * freezer_count - tell freezer to stop ignoring %current
- *
- * Undo freezer_do_not_count().  It tells freezers that %current should be
- * considered again and tries to freeze if freezing condition is already in
- * effect.
- */
-static inline void freezer_count(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezer_count_unsafe(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	smp_mb();
-	try_to_freeze_unsafe();
-}
-
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
-/*
- * These functions are intended to be used whenever you want allow a sleeping
- * task to be frozen. Note that neither return any clear indication of
- * whether a freeze event happened while in this function.
- */
-
-/* Like schedule(), but should not block the freezer. */
-static inline void freezable_schedule(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezable_schedule_unsafe(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count_unsafe();
-}
-
-/*
- * Like schedule_timeout(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout_interruptible(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
-{
-	long __retval;
-
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/* Like schedule_timeout_killable(), but should not block the freezer. */
-static inline long freezable_schedule_timeout_killable(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/*
- * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
-		u64 delta, const enum hrtimer_mode mode)
-{
-	int __retval;
-	freezer_do_not_count();
-	__retval = schedule_hrtimeout_range(expires, delta, mode);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Freezer-friendly wrappers around wait_event_interruptible(),
- * wait_event_killable() and wait_event_interruptible_timeout(), originally
- * defined in <linux/wait.h>
- */
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-({									\
-	int __retval;							\
-	freezer_do_not_count();						\
-	__retval = wait_event_killable(wq, (condition));		\
-	freezer_count_unsafe();						\
-	__retval;							\
-})
-
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
@@ -281,35 +83,9 @@ static inline void thaw_kernel_threads(v
 
 static inline bool try_to_freeze(void) { return false; }
 
-static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
-#define freezable_schedule()  schedule()
-
-#define freezable_schedule_unsafe()  schedule()
-
-#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
-
-#define freezable_schedule_timeout_interruptible(timeout)		\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_interruptible_unsafe(timeout)	\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_killable(timeout)			\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_timeout_killable_unsafe(timeout)		\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_hrtimeout_range(expires, delta, mode)	\
-	schedule_hrtimeout_range(expires, delta, mode)
-
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-		wait_event_killable(wq, condition)
-
 #endif /* !CONFIG_FREEZER */
 
 #endif	/* FREEZER_H_INCLUDED */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,25 +80,32 @@ struct task_group;
  */
 
 /* Used in tsk->state: */
-#define TASK_RUNNING			0x0000
-#define TASK_INTERRUPTIBLE		0x0001
-#define TASK_UNINTERRUPTIBLE		0x0002
-#define __TASK_STOPPED			0x0004
-#define __TASK_TRACED			0x0008
+#define TASK_RUNNING			0x000000
+#define TASK_INTERRUPTIBLE		0x000001
+#define TASK_UNINTERRUPTIBLE		0x000002
+#define __TASK_STOPPED			0x000004
+#define __TASK_TRACED			0x000008
 /* Used in tsk->exit_state: */
-#define EXIT_DEAD			0x0010
-#define EXIT_ZOMBIE			0x0020
+#define EXIT_DEAD			0x000010
+#define EXIT_ZOMBIE			0x000020
 #define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
 /* Used in tsk->state again: */
-#define TASK_PARKED			0x0040
-#define TASK_DEAD			0x0080
-#define TASK_WAKEKILL			0x0100
-#define TASK_WAKING			0x0200
-#define TASK_NOLOAD			0x0400
-#define TASK_NEW			0x0800
-/* RT specific auxilliary flag to mark RT lock waiters */
-#define TASK_RTLOCK_WAIT		0x1000
-#define TASK_STATE_MAX			0x2000
+#define TASK_PARKED			0x000040
+#define TASK_DEAD			0x000080
+#define TASK_WAKEKILL			0x000100
+#define TASK_WAKING			0x000200
+#define TASK_NOLOAD			0x000400
+#define TASK_NEW			0x000800
+#define TASK_FREEZABLE			0x001000
+#define __TASK_FREEZABLE_UNSAFE	       (0x002000 * IS_ENABLED(CONFIG_LOCKDEP))
+#define TASK_FROZEN			0x004000
+#define TASK_RTLOCK_WAIT		0x008000
+#define TASK_STATE_MAX			0x010000
+
+/*
+ * DO NOT ADD ANY NEW USERS !
+ */
+#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1698,7 +1705,6 @@ extern struct pid *cad_pid;
 #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1709,7 +1715,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -252,7 +252,7 @@ int		rpc_malloc(struct rpc_task *);
 void		rpc_free(struct rpc_task *);
 int		rpciod_up(void);
 void		rpciod_down(void);
-int		__rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int		rpc_wait_for_completion_task(struct rpc_task *task);
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 struct net;
 void		rpc_show_tasks(struct net *);
@@ -264,11 +264,6 @@ extern struct workqueue_struct *xprtiod_
 void		rpc_prepare_task(struct rpc_task *task);
 gfp_t		rpc_task_gfp_mask(void);
 
-static inline int rpc_wait_for_completion_task(struct rpc_task *task)
-{
-	return __rpc_wait_for_completion_task(task, NULL);
-}
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
 static inline const char * rpc_qname(const struct rpc_wait_queue *q)
 {
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -361,8 +361,8 @@ do {										\
 } while (0)
 
 #define __wait_event_freezable(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-			    freezable_schedule())
+	___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),	\
+			0, 0, schedule())
 
 /**
  * wait_event_freezable - sleep (or freeze) until a condition gets true
@@ -420,8 +420,8 @@ do {										\
 
 #define __wait_event_freezable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
-		      TASK_INTERRUPTIBLE, 0, timeout,				\
-		      __ret = freezable_schedule_timeout(__ret))
+		      (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,		\
+		      __ret = schedule_timeout(__ret))
 
 /*
  * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
@@ -641,8 +641,8 @@ do {										\
 
 
 #define __wait_event_freezable_exclusive(wq, condition)				\
-	___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,			\
-			freezable_schedule())
+	___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
+			schedule())
 
 #define wait_event_freezable_exclusive(wq, condition)				\
 ({										\
@@ -931,6 +931,34 @@ extern int do_wait_intr_irq(wait_queue_h
 	__ret;									\
 })
 
+#define __wait_event_state(wq, condition, state)				\
+	___wait_event(wq, condition, state, 0, 0, schedule())
+
+/**
+ * wait_event_state - sleep until a condition gets true
+ * @wq_head: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @state: state to sleep in
+ *
+ * The process is put to sleep (@state) until the @condition evaluates to true
+ * or a signal is received.  The @condition is checked each time the waitqueue
+ * @wq_head is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function will return -ERESTARTSYS if it was interrupted by a
+ * signal and 0 if @condition evaluated to true.
+ */
+#define wait_event_state(wq_head, condition, state)				\
+({										\
+	int __ret = 0;								\
+	might_sleep();								\
+	if (!(condition))							\
+		__ret = __wait_event_state(wq_head, condition, state);		\
+	__ret;									\
+})
+
 #define __wait_event_killable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
 		      TASK_KILLABLE, 0, timeout,				\
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -113,7 +113,7 @@ static int freezer_css_online(struct cgr
 
 	if (parent && (parent->state & CGROUP_FREEZING)) {
 		freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 	}
 
 	mutex_unlock(&freezer_mutex);
@@ -134,7 +134,7 @@ static void freezer_css_offline(struct c
 	mutex_lock(&freezer_mutex);
 
 	if (freezer->state & CGROUP_FREEZING)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 
 	freezer->state = 0;
 
@@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup
 			__thaw_task(task);
 		} else {
 			freeze_task(task);
+
 			/* clear FROZEN and propagate upwards */
 			while (freezer && (freezer->state & CGROUP_FROZEN)) {
 				freezer->state &= ~CGROUP_FROZEN;
@@ -271,16 +272,8 @@ static void update_if_frozen(struct cgro
 	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
-		if (freezing(task)) {
-			/*
-			 * freezer_should_skip() indicates that the task
-			 * should be skipped when determining freezing
-			 * completion.  Consider it frozen in addition to
-			 * the usual frozen condition.
-			 */
-			if (!frozen(task) && !freezer_should_skip(task))
-				goto out_iter_end;
-		}
+		if (freezing(task) && !frozen(task))
+			goto out_iter_end;
 	}
 
 	freezer->state |= CGROUP_FROZEN;
@@ -357,7 +350,7 @@ static void freezer_apply_state(struct f
 
 	if (freeze) {
 		if (!(freezer->state & CGROUP_FREEZING))
-			atomic_inc(&system_freezing_cnt);
+			static_branch_inc(&freezer_active);
 		freezer->state |= state;
 		freeze_cgroup(freezer);
 	} else {
@@ -366,9 +359,9 @@ static void freezer_apply_state(struct f
 		freezer->state &= ~state;
 
 		if (!(freezer->state & CGROUP_FREEZING)) {
-			if (was_freezing)
-				atomic_dec(&system_freezing_cnt);
 			freezer->state &= ~CGROUP_FROZEN;
+			if (was_freezing)
+				static_branch_dec(&freezer_active);
 			unfreeze_cgroup(freezer);
 		}
 	}
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -374,10 +374,10 @@ static void coredump_task_exit(struct ta
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
-			freezable_schedule();
+			schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 	}
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1417,13 +1417,12 @@ static void complete_vfork_done(struct t
 static int wait_for_vfork_done(struct task_struct *child,
 				struct completion *vfork)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE;
 	int killed;
 
-	freezer_do_not_count();
 	cgroup_enter_frozen();
-	killed = wait_for_completion_killable(vfork);
+	killed = wait_for_completion_state(vfork, state);
 	cgroup_leave_frozen(false);
-	freezer_count();
 
 	if (killed) {
 		task_lock(child);
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,10 +13,11 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
-/* indicate whether PM freezing is in effect, protected by
+/*
+ * indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
  */
 bool pm_freezing;
@@ -29,7 +30,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -52,41 +53,40 @@ bool freezing_slow_path(struct task_stru
 }
 EXPORT_SYMBOL(freezing_slow_path);
 
+bool frozen(struct task_struct *p)
+{
+	return READ_ONCE(p->__state) & TASK_FROZEN;
+}
+
 /* Refrigerator is place where frozen processes are stored :-). */
 bool __refrigerator(bool check_kthr_stop)
 {
-	/* Hmm, should we be allowed to suspend when there are realtime
-	   processes around? */
+	unsigned int state = get_current_state();
 	bool was_frozen = false;
-	unsigned int save = get_current_state();
 
 	pr_debug("%s entered refrigerator\n", current->comm);
 
+	WARN_ON_ONCE(state && !(state & TASK_NORMAL));
+
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		bool freeze;
+
+		set_current_state(TASK_FROZEN);
 
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
+		if (!freeze)
 			break;
+
 		was_frozen = true;
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-	/*
-	 * Restore saved task state before returning.  The mb'd version
-	 * needs to be used; otherwise, it might silently break
-	 * synchronization which depends on ordered task state change.
-	 */
-	set_current_state(save);
-
 	return was_frozen;
 }
 EXPORT_SYMBOL(__refrigerator);
@@ -101,6 +101,41 @@ static void fake_signal_wake_up(struct t
 	}
 }
 
+static int __set_task_frozen(struct task_struct *p, void *arg)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if (p->on_rq)
+		return 0;
+
+	if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED)))
+		return 0;
+
+	/*
+	 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they
+	 * can suffer spurious wakeups.
+	 */
+	if (state & TASK_FREEZABLE)
+		WARN_ON_ONCE(!(state & TASK_NORMAL));
+
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It's dangerous to freeze with locks held; there be dragons there.
+	 */
+	if (!(state & __TASK_FREEZABLE_UNSAFE))
+		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+#endif
+
+	WRITE_ONCE(p->__state, TASK_FROZEN);
+	return TASK_FROZEN;
+}
+
+static bool __freeze_task(struct task_struct *p)
+{
+	/* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */
+	return task_call_func(p, __set_task_frozen, NULL);
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +151,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -137,19 +160,61 @@ bool freeze_task(struct task_struct *p)
 	if (!(p->flags & PF_KTHREAD))
 		fake_signal_wake_up(p);
 	else
-		wake_up_state(p, TASK_INTERRUPTIBLE);
+		wake_up_state(p, TASK_NORMAL);
 
 	spin_unlock_irqrestore(&freezer_lock, flags);
 	return true;
 }
 
+/*
+ * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical
+ * state in p->jobctl. If either of them got a wakeup that was missed because
+ * TASK_FROZEN, then their canonical state reflects that and the below will
+ * refuse to restore the special state and instead issue the wakeup.
+ */
+static int __set_task_special(struct task_struct *p, void *arg)
+{
+	unsigned int state = 0;
+
+	if (p->jobctl & JOBCTL_TRACED) {
+		state = __TASK_TRACED;
+
+		if (!(p->jobctl & JOBCTL_TRACED_FROZEN)) {
+			state |= TASK_WAKEKILL;
+			if (__fatal_signal_pending(p))
+				state = 0;
+		}
+
+	} else if ((p->jobctl & JOBCTL_STOPPED) &&
+		   !__fatal_signal_pending(p)) {
+
+		state = TASK_STOPPED;
+	}
+
+	if (state)
+		WRITE_ONCE(p->__state, state);
+
+	return state;
+}
+
 void __thaw_task(struct task_struct *p)
 {
-	unsigned long flags;
+	unsigned long flags, flags2;
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
-		wake_up_process(p);
+	if (WARN_ON_ONCE(freezing(p)))
+		goto unlock;
+
+	if (lock_task_sighand(p, &flags2)) {
+		/* TASK_FROZEN -> TASK_{STOPPED,TRACED} */
+		bool ret = task_call_func(p, __set_task_special, NULL);
+		unlock_task_sighand(p, &flags2);
+		if (ret)
+			goto unlock;
+	}
+
+	wake_up_state(p, TASK_FROZEN);
+unlock:
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -334,7 +334,7 @@ void futex_wait_queue(struct futex_hash_
 	 * futex_queue() calls spin_unlock() upon completion, both serializing
 	 * access to the hash list and forcing another memory barrier.
 	 */
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	futex_queue(q, hb);
 
 	/* Arm the timer */
@@ -352,7 +352,7 @@ void futex_wait_queue(struct futex_hash_
 		 * is no timeout, or if it has yet to expire.
 		 */
 		if (!timeout || timeout->task)
-			freezable_schedule();
+			schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 }
@@ -430,7 +430,7 @@ static int futex_wait_multiple_setup(str
 			return ret;
 	}
 
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 	for (i = 0; i < count; i++) {
 		u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
@@ -504,7 +504,7 @@ static void futex_sleep_multiple(struct
 			return;
 	}
 
-	freezable_schedule();
+	schedule();
 }
 
 /**
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -95,8 +95,8 @@ static void check_hung_task(struct task_
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(READ_ONCE(t->__state) & (TASK_FREEZABLE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -24,7 +24,7 @@
 unsigned int lock_system_sleep(void)
 {
 	unsigned int flags = current->flags;
-	current->flags |= PF_FREEZER_SKIP;
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 	return flags;
 }
@@ -48,8 +48,8 @@ void unlock_system_sleep(unsigned int fl
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	if (!(flags & PF_FREEZER_SKIP))
-		current->flags &= ~PF_FREEZER_SKIP;
+	if (!(flags & PF_NOFREEZE))
+		current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -132,7 +130,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(0);
 	pr_info("Freezing user space processes ... ");
@@ -193,7 +191,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
 }
 EXPORT_SYMBOL(wait_for_completion_killable);
 
+int __sched wait_for_completion_state(struct completion *x, unsigned int state)
+{
+	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
+	if (t == -ERESTARTSYS)
+		return t;
+	return 0;
+}
+EXPORT_SYMBOL(wait_for_completion_state);
+
 /**
  * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
  * @x:  holds the state of this particular completion
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3260,6 +3260,19 @@ int migrate_swap(struct task_struct *cur
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+static inline bool __wti_match(struct task_struct *p, unsigned int match_state)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if ((match_state & TASK_FREEZABLE) && state == TASK_FROZEN)
+		return true;
+
+	if (state == (match_state & ~TASK_FREEZABLE))
+		return true;
+
+	return false;
+}
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3304,7 +3317,7 @@ unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_running(rq, p)) {
-			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
+			if (match_state && !__wti_match(p, match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3319,7 +3332,7 @@ unsigned long wait_task_inactive(struct
 		running = task_running(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || READ_ONCE(p->__state) == match_state)
+		if (!match_state || __wti_match(p, match_state))
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);
 
@@ -6320,7 +6333,7 @@ static void __sched notrace __schedule(u
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2292,7 +2292,7 @@ static int ptrace_stop(int exit_code, in
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
 		preempt_enable_no_resched();
-		freezable_schedule();
+		schedule();
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2483,7 +2483,7 @@ static bool do_signal_stop(int signr)
 
 		/* Now we don't run again until woken by SIGCONT or SIGKILL */
 		cgroup_enter_frozen();
-		freezable_schedule();
+		schedule();
 		return true;
 	} else {
 		/*
@@ -2558,11 +2558,11 @@ static void do_freezer_trap(void)
 	 * immediately (if there is a non-fatal signal pending), and
 	 * put the task into sleep.
 	 */
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	clear_thread_flag(TIF_SIGPENDING);
 	spin_unlock_irq(&current->sighand->siglock);
 	cgroup_enter_frozen();
-	freezable_schedule();
+	schedule();
 }
 
 static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
@@ -3608,9 +3608,9 @@ static int do_sigtimedwait(const sigset_
 		recalc_sigpending();
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		__set_current_state(TASK_INTERRUPTIBLE);
-		ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
-							 HRTIMER_MODE_REL);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
+					       HRTIMER_MODE_REL);
 		spin_lock_irq(&tsk->sighand->siglock);
 		__set_task_blocked(tsk, &tsk->real_blocked);
 		sigemptyset(&tsk->real_blocked);
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2037,11 +2037,11 @@ static int __sched do_nanosleep(struct h
 	struct restart_block *restart;
 
 	do {
-		set_current_state(TASK_INTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		hrtimer_sleeper_start_expires(t, mode);
 
 		if (likely(t->task))
-			freezable_schedule();
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -404,6 +404,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup)
  */
 int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE;
 	DECLARE_COMPLETION_ONSTACK(done);
 	int retval = 0;
 
@@ -437,25 +438,22 @@ int call_usermodehelper_exec(struct subp
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
 
+	if (wait & UMH_KILLABLE)
+		state |= TASK_KILLABLE;
+
 	if (wait & UMH_FREEZABLE)
-		freezer_do_not_count();
+		state |= TASK_FREEZABLE;
 
-	if (wait & UMH_KILLABLE) {
-		retval = wait_for_completion_killable(&done);
-		if (!retval)
-			goto wait_done;
+	retval = wait_for_completion_state(&done, state);
+	if (!retval)
+		goto wait_done;
 
+	if (wait & UMH_KILLABLE) {
 		/* umh_complete() will see NULL and free sub_info */
 		if (xchg(&sub_info->complete, NULL))
 			goto unlock;
-		/* fallthrough, umh_complete() was already called */
 	}
 
-	wait_for_completion(&done);
-
-	if (wait & UMH_FREEZABLE)
-		freezer_count();
-
 wait_done:
 	retval = sub_info->retval;
 out:
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -780,8 +780,8 @@ static void khugepaged_alloc_sleep(void)
 	DEFINE_WAIT(wait);
 
 	add_wait_queue(&khugepaged_wait, &wait);
-	freezable_schedule_timeout_interruptible(
-		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -268,7 +268,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -332,14 +332,12 @@ static int rpc_complete_task(struct rpc_
  * to enforce taking of the wq->lock and hence avoid races with
  * rpc_complete_task().
  */
-int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
+int rpc_wait_for_completion_task(struct rpc_task *task)
 {
-	if (action == NULL)
-		action = rpc_wait_bit_killable;
 	return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
-			action, TASK_KILLABLE);
+			rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
-EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
+EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
 
 /*
  * Make an RPC task runnable.
@@ -963,7 +961,7 @@ static void __rpc_execute(struct rpc_tas
 		trace_rpc_task_sync_sleep(task, task->tk_action);
 		status = out_of_line_wait_on_bit(&task->tk_runstate,
 				RPC_TASK_QUEUED, rpc_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE);
 		if (status < 0) {
 			/*
 			 * When a sync task receives a signal, it exits with
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2530,13 +2530,14 @@ static long unix_stream_data_wait(struct
 				  struct sk_buff *last, unsigned int last_len,
 				  bool freezable)
 {
+	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, state);
 
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (tail != last ||
@@ -2549,10 +2550,7 @@ static long unix_stream_data_wait(struct
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		if (freezable)
-			timeo = freezable_schedule_timeout(timeo);
-		else
-			timeo = schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-12 11:44 ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
@ 2022-04-13 13:24   ` Oleg Nesterov
  2022-04-13 16:58     ` Peter Zijlstra
  2022-04-13 18:57     ` Oleg Nesterov
  0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-13 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

Hi Peter,

I like 1-2 but I need to read them (and other patches) again, a
couple of nits right now.

On 04/12, Peter Zijlstra wrote:
>
> +static int __ptrace_freeze_cond(struct task_struct *p)
> +{
> +	if (!task_is_traced(p))
> +		return -ESRCH;

	if (!task_is_traced(p) || p->parent != current)
		return -ESRCH;

we should not spin/sleep if it is traced by another task

> +static int __ptrace_freeze(struct task_struct *p, void *arg)
> +{
> +	int ret;
> +
> +	ret = __ptrace_freeze_cond(p);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
> +	 * anymore.
> +	 */
> +	if (*(unsigned long *)arg != p->nvcsw)
> +		return -ESRCH;
> +
> +	if (looks_like_a_spurious_pid(p))
> +		return -ESRCH;

Oh, I do not think __ptrace_freeze() should check for spurious pid...
looks_like_a_spurious_pid() should be called once in ptrace_check_attach()
before task_call_func(__ptrace_freeze).

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  2022-04-12 11:44 ` [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Peter Zijlstra
@ 2022-04-13 13:29   ` Oleg Nesterov
  2022-04-13 16:47     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-13 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/12, Peter Zijlstra wrote:
>
> @@ -475,8 +483,10 @@ static int ptrace_attach(struct task_str
>  	 * in and out of STOPPED are protected by siglock.
>  	 */
>  	if (task_is_stopped(task) &&
> -	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING))
> +	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) {
> +		task->jobctl &= ~JOBCTL_STOPPED;
>  		signal_wake_up_state(task, __TASK_STOPPED);

OK, but just for record before I forget...

It seems that we can s/JOBCTL_STOPPED/JOBCTL_TRACED/ instead, and kill the
nasty wait_on_bit(JOBCTL_TRAPPING_BIT) along with JOBCTL_TRAPPING_BIT. Sure,
this doesn't belong to this series.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
  2022-04-13 13:29   ` Oleg Nesterov
@ 2022-04-13 16:47     ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-13 16:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Wed, Apr 13, 2022 at 03:29:22PM +0200, Oleg Nesterov wrote:
> On 04/12, Peter Zijlstra wrote:
> >
> > @@ -475,8 +483,10 @@ static int ptrace_attach(struct task_str
> >  	 * in and out of STOPPED are protected by siglock.
> >  	 */
> >  	if (task_is_stopped(task) &&
> > -	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING))
> > +	    task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) {
> > +		task->jobctl &= ~JOBCTL_STOPPED;
> >  		signal_wake_up_state(task, __TASK_STOPPED);
> 
> OK, but just for record before I forget...
> 
> It seems that we can s/JOBCTL_STOPPED/JOBCTL_TRACED/ instead, and kill the
> nasty wait_on_bit(JOBCTL_TRAPPING_BIT) along with JOBCTL_TRAPPING_BIT. Sure,
> this doesn't belong to this series.

I'm afraid I didn't look hard enough at that part to really understand
it, but some cleanup around there sounds lovely.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 13:24   ` Oleg Nesterov
@ 2022-04-13 16:58     ` Peter Zijlstra
  2022-04-13 18:57     ` Oleg Nesterov
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-13 16:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Wed, Apr 13, 2022 at 03:24:52PM +0200, Oleg Nesterov wrote:
> Hi Peter,
> 
> I like 1-2 but I need to read them (and other patches) again, a
> couple of nits right now.
> 
> On 04/12, Peter Zijlstra wrote:
> >
> > +static int __ptrace_freeze_cond(struct task_struct *p)
> > +{
> > +	if (!task_is_traced(p))
> > +		return -ESRCH;
> 
> 	if (!task_is_traced(p) || p->parent != current)
> 		return -ESRCH;
> 
> we should not spin/sleep if it is traced by another task

Yes, fair enough. And I suppose doing this test without holding siglock
is safe enough.

> > +static int __ptrace_freeze(struct task_struct *p, void *arg)
> > +{
> > +	int ret;
> > +
> > +	ret = __ptrace_freeze_cond(p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
> > +	 * anymore.
> > +	 */
> > +	if (*(unsigned long *)arg != p->nvcsw)
> > +		return -ESRCH;
> > +
> > +	if (looks_like_a_spurious_pid(p))
> > +		return -ESRCH;
> 
> Oh, I do not think __ptrace_freeze() should check for spurious pid...
> looks_like_a_spurious_pid() should be called once in ptrace_check_attach()
> before task_call_func(__ptrace_freeze).

I can certainly do that, but since that needs be done with siglock held,
and the __ptrace_freeze call is a one-time affair, I didn't really see
the point in making the code more complicated.

Something like so then?

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -222,7 +222,7 @@ static void ptrace_unfreeze_traced(struc
  */
 static int __ptrace_freeze_cond(struct task_struct *p)
 {
-	if (!task_is_traced(p))
+	if (!task_is_traced(p) || p->parent != current)
 		return -ESRCH;
 
 	if (task_curr(p))
@@ -283,9 +283,6 @@ static int __ptrace_freeze(struct task_s
 	if (*(unsigned long *)arg != p->nvcsw)
 		return -ESRCH;
 
-	if (looks_like_a_spurious_pid(p))
-		return -ESRCH;
-
 	if (__fatal_signal_pending(p))
 		return -ESRCH;
 
@@ -378,6 +375,9 @@ static int ptrace_check_attach(struct ta
 		 * does ptrace_unlink() before __exit_signal().
 		 */
 		spin_lock_irq(&child->sighand->siglock);
+		if (looks_like_a_spurious_pid(child))
+			goto unlock_sig;
+
 		ret = task_call_func(child, __ptrace_freeze, &nvcsw);
 		if (ret) {
 			/*
@@ -386,6 +386,7 @@ static int ptrace_check_attach(struct ta
 			 */
 			ret = -ESRCH;
 		}
+unlock_sig:
 		spin_unlock_irq(&child->sighand->siglock);
 	}
 unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 13:24   ` Oleg Nesterov
  2022-04-13 16:58     ` Peter Zijlstra
@ 2022-04-13 18:57     ` Oleg Nesterov
  2022-04-13 18:59       ` Oleg Nesterov
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-13 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/13, Oleg Nesterov wrote:
>
> I like 1-2 but I need to read them (and other patches) again, a
> couple of nits right now.

Sorry, didn't have time to do this today, and now I am already sleeping.

But... on a second thought, it seems there is a better solution. If nothing
else it is simpler and doesn't duplicate the wait_task_inactive() logic.

How about the patch below instead? On top of 1/5.

Yes,yes, incomplete. in particular see the "!!!!!!!!!" comments. Just to
explain the idea.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 18:57     ` Oleg Nesterov
@ 2022-04-13 18:59       ` Oleg Nesterov
  2022-04-13 19:20         ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-13 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/13, Oleg Nesterov wrote:
>
> On 04/13, Oleg Nesterov wrote:
> >
> > I like 1-2 but I need to read them (and other patches) again, a
> > couple of nits right now.
>
> Sorry, didn't have time to do this today, and now I am already sleeping.
>
> But... on a second thought, it seems there is a better solution. If nothing
> else it is simpler and doesn't duplicate the wait_task_inactive() logic.
>
> How about the patch below instead? On top of 1/5.
>
> Yes,yes, incomplete. in particular see the "!!!!!!!!!" comments. Just to
> explain the idea.

Cough. forget to attach the patch, sorry for noise.

Oleg.
---

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..86b5226e6ba2 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -255,6 +255,19 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
 	int ret = -ESRCH;
 
+	if (!(child->ptrace && child->parent == current))
+		return ret;
+
+	if (ignore_state)
+		return 0;
+
+	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
+		return -EINTR;
+	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
+	// wait_task_inactive() should succeed or fail "really soon".
+	if (!wait_task_inactive(child, __TASK_TRACED))
+		return ret;
+
 	/*
 	 * We take the read lock around doing both checks to close a
 	 * possible race where someone else was tracing our child and
@@ -269,23 +282,11 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		if (ignore_state || ptrace_freeze_traced(child))
+		if (ptrace_freeze_traced(child))
 			ret = 0;
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
-		}
-	}
-
 	return ret;
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..5ca6235e5231 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2220,7 +2220,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= (JOBCTL_TRACED | JOBCTL_TRACED_XXX);
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2291,6 +2291,10 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		preempt_disable();
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
+		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+		// wrong, needs siglock
+		current->jobctl &= ~JOBCTL_TRACED_XXX;
+		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
 		preempt_enable_no_resched();
 		freezable_schedule();
 		cgroup_leave_frozen(true);
@@ -2308,6 +2312,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+		// need to clear ~JOBCTL_TRACED_XXX
 		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		read_code = false;


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 18:59       ` Oleg Nesterov
@ 2022-04-13 19:20         ` Peter Zijlstra
  2022-04-13 19:56           ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-13 19:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:


> +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +		// wrong, needs siglock
> +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
		  __wake_up_common_lock()
		    spin_lock_irqsave()
		      current_save_and_set_rtlock_wait_state();

			> +	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
			> +		return -EINTR;
			> +	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
			> +	// wait_task_inactive() should succeed or fail "really soon".
			> +	if (!wait_task_inactive(child, __TASK_TRACED))
			> +		return ret;


	*whoopsie* ?

>  		preempt_enable_no_resched();
>  		freezable_schedule();
>  		cgroup_leave_frozen(true);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 19:20         ` Peter Zijlstra
@ 2022-04-13 19:56           ` Peter Zijlstra
  2022-04-14 11:54             ` Oleg Nesterov
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-13 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Wed, Apr 13, 2022 at 09:20:53PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:
> 
> 
> > +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +		// wrong, needs siglock
> > +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> > +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
> 		  __wake_up_common_lock()
> 		    spin_lock_irqsave()
> 		      current_save_and_set_rtlock_wait_state();
> 
> 			> +	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
> 			> +		return -EINTR;
> 			> +	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
> 			> +	// wait_task_inactive() should succeed or fail "really soon".
> 			> +	if (!wait_task_inactive(child, __TASK_TRACED))
> 			> +		return ret;
> 
> 
> 	*whoopsie* ?
> 
> >  		preempt_enable_no_resched();
> >  		freezable_schedule();
> >  		cgroup_leave_frozen(true);

Something that might work; since there's only the one ptracer, is to
instead do something like:

	current->jobctl &= ~JOBCTL_TRACED_XXX; // siglock
	if (current->ptrace)
		wake_up_process(current->parent);
	preempt_enable_no_resched();
	schedule();


vs

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (!(p->jobctl & JOBCTL_TRACED_XXX))
			break;
		schedule();
	}
	__set_current_state(TASK_RUNNING);

	if (!wait_task_inactive(...))


ptrace_detach() needs some additional magic as well I think, but this
might just work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-13 19:56           ` Peter Zijlstra
@ 2022-04-14 11:54             ` Oleg Nesterov
  2022-04-14 12:08               ` Oleg Nesterov
  2022-04-14 18:34               ` Oleg Nesterov
  0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-14 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/13, Peter Zijlstra wrote:
>
> On Wed, Apr 13, 2022 at 09:20:53PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:
> >
> >
> > > +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +		// wrong, needs siglock
> > > +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> > > +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
> > 		  __wake_up_common_lock()
> > 		    spin_lock_irqsave()
> > 		      current_save_and_set_rtlock_wait_state();

OOPS, thanks.

> Something that might work; since there's only the one ptracer, is to
> instead do something like:
>
> 	current->jobctl &= ~JOBCTL_TRACED_XXX; // siglock
> 	if (current->ptrace)
> 		wake_up_process(current->parent);
> 	preempt_enable_no_resched();
> 	schedule();
>
>
> vs
>
> 	for (;;) {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 		if (!(p->jobctl & JOBCTL_TRACED_XXX))
> 			break;
> 		schedule();

Yes, thanks... this is racy, see below, but probably fixeable.

> ptrace_detach() needs some additional magic as well I think, but this
> might just work.

I don't think so, JOBCTL_TRACED_XXX must be always cleared in ptrace_stop()
and ptrace_detach() implies ptrace_check_attach().


Lets forget about the proble above for the moment. There is another problem
with my patch,

	if (!(child->ptrace && child->parent == current))
		return ret;

this check is racy without tasklist, we can race with another task attaching
to our natural child (so that child->parent == current), ptrace_attach() sets
task->ptrace = flags first and changes child->parent after that.

In this case:

	if (ignore_state)
		return 0;

this is just wrong,

	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
		return -EINTR;

this is fine,

	if (!wait_task_inactive(child, __TASK_TRACED))

this not right too. wait_task_inactive() can loop "forever" doing schedule_hrtimeout()
if the actual debugger stops/resumes the tracee continuously. This is pure theoretical,
but still.

And this also means that the code above needs some changes too, we can rely on
wake_up_process(current->parent).

OK, let me think about it. Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-14 11:54             ` Oleg Nesterov
@ 2022-04-14 12:08               ` Oleg Nesterov
  2022-04-14 18:34               ` Oleg Nesterov
  1 sibling, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-14 12:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

this doesn't really matter, just for completeness:

On 04/14, Oleg Nesterov wrote:
>
> 	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
> 		return -EINTR;
>
> this is fine,

No, this is wrong too. wake_up_bit() does exclusive wakeup.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-14 11:54             ` Oleg Nesterov
  2022-04-14 12:08               ` Oleg Nesterov
@ 2022-04-14 18:34               ` Oleg Nesterov
  2022-04-14 22:45                 ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-14 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/14, Oleg Nesterov wrote:
>
> OK, let me think about it. Thanks!

So. what do you think about the patch below?

If it can work, then 1/5 needs some changes, I think. In particular,
it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
we can avoid this flag altogether...

This is how ptrace_check_attach() looks with the patch applied:

	static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
	{
		int traced;
		/*
		 * We take the read lock around doing both checks to close a
		 * possible race where someone else attaches or detaches our
		 * natural child.
		 */
		read_lock(&tasklist_lock);
		traced = child->ptrace && child->parent == current;
		read_unlock(&tasklist_lock);

		if (!traced)
			return -ESRCH;

		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
		if (ignore_state)
			return 0;

		for (;;) {
			if (fatal_signal_pending(current))
				return -EINTR;
			set_current_state(TASK_KILLABLE);
			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
				__set_current_state(TASK_RUNNING);
				break;
			}
			schedule();
		}

		if (!wait_task_inactive(child, TASK_TRACED) ||
		    !ptrace_freeze_traced(child))
			return -ESRCH;

		return 0;
	}

Oleg.
---

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..5a03ae5cb0c0 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,20 +193,22 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
+	if (!lock_task_sighand(task, &flags))
+		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
 		task->jobctl |= JOBCTL_TRACED_FROZEN;
 		WRITE_ONCE(task->__state, __TASK_TRACED);
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
@@ -253,40 +255,39 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
-
+	int traced;
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
+	if (!traced)
+		return -ESRCH;
+
+	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+	if (ignore_state)
+		return 0;
+
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		schedule();
 	}
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..684f0a0e9c71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2182,6 +2182,14 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_xxx(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	current->jobctl &= ~JOBCTL_TRACED_XXX;
+	spin_unlock_irq(&current->sighand->siglock);
+	wake_up_state(current->parent, TASK_KILLABLE);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2220,7 +2228,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2282,6 +2290,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done && ptrace_reparented(current))
 			do_notify_parent_cldstop(current, false, why);
 
+		clear_traced_xxx();
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
@@ -2296,9 +2305,6 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		cgroup_leave_frozen(true);
 	} else {
 		/*
-		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
 		 * If @gstop_done, the ptracer went away between group stop
 		 * completion and here.  During detach, it would have set
 		 * JOBCTL_STOP_PENDING on us and we'll re-enter
@@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		 */
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
+		clear_traced_xxx();
+		read_unlock(&tasklist_lock);
 
-		/* tasklist protects us from ptrace_freeze_traced() */
+		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		read_code = false;
 		if (clear_code)
 			exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-14 18:34               ` Oleg Nesterov
@ 2022-04-14 22:45                 ` Peter Zijlstra
  2022-04-15 10:16                   ` Oleg Nesterov
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-14 22:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
> On 04/14, Oleg Nesterov wrote:
> >
> > OK, let me think about it. Thanks!
> 
> So. what do you think about the patch below?

Might just work, will have to look at it again though ;-) Comments
below.

> If it can work, then 1/5 needs some changes, I think. In particular,
> it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps

That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
state, and isn't related to the freezer.

> we can avoid this flag altogether...
> 
> This is how ptrace_check_attach() looks with the patch applied:
> 
> 	static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> 	{
> 		int traced;
> 		/*
> 		 * We take the read lock around doing both checks to close a
> 		 * possible race where someone else attaches or detaches our
> 		 * natural child.
> 		 */
> 		read_lock(&tasklist_lock);
> 		traced = child->ptrace && child->parent == current;
> 		read_unlock(&tasklist_lock);
> 
> 		if (!traced)
> 			return -ESRCH;

The thing being, that if it is our ptrace child, it won't be going away
since we're running this code and not ptrace_detach().  Right?

I failed to realize this earlier and I think that's part of why my
solution is somewhat over complicated.

> 		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> 		if (ignore_state)
> 			return 0;
> 
> 		for (;;) {
> 			if (fatal_signal_pending(current))
> 				return -EINTR;

What if signal_wake_up(.resume=true) happens here? In that case we miss
the fatal pending, and task state isn't changed yet so we'll happily go
sleep.

> 			set_current_state(TASK_KILLABLE);
> 			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {

  TRACED_XXX ?

> 				__set_current_state(TASK_RUNNING);
> 				break;
> 			}
> 			schedule();
> 		}

That is, shouldn't we write this like:

		for (;;) {
			set_current_state(TASK_KILLABLE);
			if (fatal_signal_pending(current)) {
				ret = -EINTR;
				break;
			}
			if (!(READ_ONCE(current->jobctl) & JOBCTL_TRACED_XXX)) {
				ret = 0;
				break;
			}
			schedule();
		}
		__set_current_state(TASK_RUNNING);
		if (ret)
			return ret;

> 		if (!wait_task_inactive(child, TASK_TRACED) ||
> 		    !ptrace_freeze_traced(child))
> 			return -ESRCH;
> 
> 		return 0;
> 	}
> 
> Oleg.
> ---
> 
> diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> index ec8b312f7506..1b5a57048e13 100644
> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -22,7 +22,8 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED_BIT	24
>  #define JOBCTL_TRACED_BIT	25
> -#define JOBCTL_TRACED_FROZEN_BIT 26
> +#define JOBCTL_TRACED_XXX_BIT	25
> +#define JOBCTL_TRACED_FROZEN_BIT 27
>  
>  #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
>  #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
> @@ -35,6 +36,7 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
>  #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
> +#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
>  #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
>  
>  #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 626f96d275c7..5a03ae5cb0c0 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -193,20 +193,22 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
>   */
>  static bool ptrace_freeze_traced(struct task_struct *task)
>  {
> +	unsigned long flags;
>  	bool ret = false;
>  
>  	/* Lockless, nobody but us can set this flag */
>  	if (task->jobctl & JOBCTL_LISTENING)
>  		return ret;
> +	if (!lock_task_sighand(task, &flags))
> +		return ret;
>  
> -	spin_lock_irq(&task->sighand->siglock);
>  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
>  	    !__fatal_signal_pending(task)) {
>  		task->jobctl |= JOBCTL_TRACED_FROZEN;
>  		WRITE_ONCE(task->__state, __TASK_TRACED);
>  		ret = true;
>  	}

I would feel much better if this were still a task_func_call()
validating !->on_rq && !->on_cpu.

> -	spin_unlock_irq(&task->sighand->siglock);
> +	unlock_task_sighand(task, &flags);
>  
>  	return ret;
>  }
> @@ -253,40 +255,39 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>   */
>  static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  {
> -	int ret = -ESRCH;
> -
> +	int traced;
>  	/*
>  	 * We take the read lock around doing both checks to close a
> -	 * possible race where someone else was tracing our child and
> -	 * detached between these two checks.  After this locked check,
> -	 * we are sure that this is our traced child and that can only
> -	 * be changed by us so it's not changing right after this.
> +	 * possible race where someone else attaches or detaches our
> +	 * natural child.
>  	 */
>  	read_lock(&tasklist_lock);
> -	if (child->ptrace && child->parent == current) {
> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> -		/*
> -		 * child->sighand can't be NULL, release_task()
> -		 * does ptrace_unlink() before __exit_signal().
> -		 */
> -		if (ignore_state || ptrace_freeze_traced(child))
> -			ret = 0;
> -	}
> +	traced = child->ptrace && child->parent == current;
>  	read_unlock(&tasklist_lock);
>  
> -	if (!ret && !ignore_state) {
> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> -			/*
> -			 * This can only happen if may_ptrace_stop() fails and
> -			 * ptrace_stop() changes ->state back to TASK_RUNNING,
> -			 * so we should not worry about leaking __TASK_TRACED.
> -			 */
> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> -			ret = -ESRCH;
> +	if (!traced)
> +		return -ESRCH;
> +
> +	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +	if (ignore_state)
> +		return 0;
> +
> +	for (;;) {
> +		if (fatal_signal_pending(current))
> +			return -EINTR;
> +		set_current_state(TASK_KILLABLE);
> +		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
> +			__set_current_state(TASK_RUNNING);
> +			break;
>  		}
> +		schedule();
>  	}
>  
> -	return ret;
> +	if (!wait_task_inactive(child, TASK_TRACED) ||
> +	    !ptrace_freeze_traced(child))
> +		return -ESRCH;
> +
> +	return 0;
>  }
>  
>  static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0aea3f0a8002..684f0a0e9c71 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2182,6 +2182,14 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> +static void clear_traced_xxx(void)
> +{
> +	spin_lock_irq(&current->sighand->siglock);
> +	current->jobctl &= ~JOBCTL_TRACED_XXX;
> +	spin_unlock_irq(&current->sighand->siglock);
> +	wake_up_state(current->parent, TASK_KILLABLE);
> +}
> +
>  /*
>   * This must be called with current->sighand->siglock held.
>   *
> @@ -2220,7 +2228,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>  	 */
> -	current->jobctl |= JOBCTL_TRACED;
> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
>  	set_special_state(TASK_TRACED);
>  
>  	/*
> @@ -2282,6 +2290,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		if (gstop_done && ptrace_reparented(current))
>  			do_notify_parent_cldstop(current, false, why);
>  
> +		clear_traced_xxx();
>  		/*
>  		 * Don't want to allow preemption here, because
>  		 * sys_ptrace() needs this task to be inactive.
> @@ -2296,9 +2305,6 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		cgroup_leave_frozen(true);
>  	} else {
>  		/*
> -		 * By the time we got the lock, our tracer went away.
> -		 * Don't drop the lock yet, another tracer may come.
> -		 *
>  		 * If @gstop_done, the ptracer went away between group stop
>  		 * completion and here.  During detach, it would have set
>  		 * JOBCTL_STOP_PENDING on us and we'll re-enter
> @@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		 */
>  		if (gstop_done)
>  			do_notify_parent_cldstop(current, false, why);
> +		clear_traced_xxx();
> +		read_unlock(&tasklist_lock);
>  
> -		/* tasklist protects us from ptrace_freeze_traced() */
> +		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */

But... TRACED_XXX has just been cleared ?!

>  		__set_current_state(TASK_RUNNING);
>  		read_code = false;
>  		if (clear_code)
>  			exit_code = 0;
> -		read_unlock(&tasklist_lock);
>  	}
>  
>  	/*
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-14 22:45                 ` Peter Zijlstra
@ 2022-04-15 10:16                   ` Oleg Nesterov
  2022-04-15 10:57                     ` Oleg Nesterov
  2022-04-15 12:00                     ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-15 10:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/15, Peter Zijlstra wrote:
>
> On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
>
> > If it can work, then 1/5 needs some changes, I think. In particular,
> > it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
>
> That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
> state, and isn't related to the freezer.

Lets forget about 3-5 which I didn't read carefully yet. So why do we
need TRACED_FROZEN?

From 1/5:

	 static inline void signal_wake_up(struct task_struct *t, bool resume)
	 {
	+	lockdep_assert_held(&t->sighand->siglock);
	+
	+	if (resume && !(t->jobctl & JOBCTL_TRACED_FROZEN))
	+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
	+
		signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
	 }
	+
	 static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
	 {
	+	lockdep_assert_held(&t->sighand->siglock);
	+
	+	if (resume)
	+		t->jobctl &= ~JOBCTL_TRACED;
	+
		signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
	 }

Can't we simply change signal_wake_up_state(),

	void signal_wake_up_state(struct task_struct *t, unsigned int state)
	{
		set_tsk_thread_flag(t, TIF_SIGPENDING);
		/*
		 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
		 * case. We don't check t->state here because there is a race with it
		 * executing another processor and just now entering stopped state.
		 * By using wake_up_state, we ensure the process will wake up and
		 * handle its death signal.
		 */
		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
		else
			kick_process(t);
	}

?

> > 		/*
> > 		 * We take the read lock around doing both checks to close a
> > 		 * possible race where someone else attaches or detaches our
> > 		 * natural child.
> > 		 */
> > 		read_lock(&tasklist_lock);
> > 		traced = child->ptrace && child->parent == current;
> > 		read_unlock(&tasklist_lock);
> >
> > 		if (!traced)
> > 			return -ESRCH;
>
> The thing being, that if it is our ptrace child, it won't be going away
> since we're running this code and not ptrace_detach().  Right?

Yes. and nobody else can detach it.

Another tracer can't attach until child->ptrace is cleared, but this can
only happen if a) this child is killed and b) another thread does wait()
and reaps it; but after that attach() is obviously impossible.

But since this child can go away, the patch changes ptrace_freeze_traced()
to use lock_task_sighand().

> > 		for (;;) {
> > 			if (fatal_signal_pending(current))
> > 				return -EINTR;
>
> What if signal_wake_up(.resume=true) happens here? In that case we miss
> the fatal pending, and task state isn't changed yet so we'll happily go
> sleep.

No, it won't sleep, see the signal_pending_state() check in schedule().

> > 			set_current_state(TASK_KILLABLE);

And let me explain TASK_KILLABLE just in case... We could just use
TASK_UNINTERRUPTIBLE and avoid the signal_pending() check, but KILLABLE
looks "safer" to me. If the tracer hangs because of some bug, at least
it can be killed from userspace.


> > 			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
>
>   TRACED_XXX ?

oops ;)

> > -	spin_lock_irq(&task->sighand->siglock);
> >  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
> >  	    !__fatal_signal_pending(task)) {
> >  		task->jobctl |= JOBCTL_TRACED_FROZEN;
> >  		WRITE_ONCE(task->__state, __TASK_TRACED);
> >  		ret = true;
> >  	}
>
> I would feel much better if this were still a task_func_call()
> validating !->on_rq && !->on_cpu.

Well, but "on_rq || on_cpu" would mean that wait_task_inactive() is buggy ?

But! I forgot to make anothet change in this code. I do not think it should
rely on task_is_traced(). We are going to abuse task->__state, so I think
it should check task->__state == TASK_TRACED directly. Say,

	if (READ_ONCE(task->__state) == TASK_TRACED && ...) {
		WRITE_ONCE(task->__state, __TASK_TRACED);
		WARN_ON_ONCE(!task_is_traced(task));
		ret = true;
	}

looks more clean to me. What do you think?

> > @@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
> >  		 */
> >  		if (gstop_done)
> >  			do_notify_parent_cldstop(current, false, why);
> > +		clear_traced_xxx();
> > +		read_unlock(&tasklist_lock);
> >
> > -		/* tasklist protects us from ptrace_freeze_traced() */
> > +		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */
>
> But... TRACED_XXX has just been cleared ?!

Cough ;) OK, I'll move __set_current_state() back under tasklist.

And in this case we do not need wake_up(parent), so we can shift it from
clear_traced_xxx() into another branch.

OK, so far it seems that this patch needs a couple of simple fixes you
pointed out, but before I send V2:

	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?

	- will you agree if I change ptrace_freeze_traced() to rely
	  on __state == TASK_TRACED rather than task_is_traced() ?

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 10:16                   ` Oleg Nesterov
@ 2022-04-15 10:57                     ` Oleg Nesterov
  2022-04-15 12:01                       ` Peter Zijlstra
  2022-04-20 10:20                       ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
  2022-04-15 12:00                     ` Peter Zijlstra
  1 sibling, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-15 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/15, Oleg Nesterov wrote:
>
> OK, so far it seems that this patch needs a couple of simple fixes you
> pointed out, but before I send V2:
>
> 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
>
> 	- will you agree if I change ptrace_freeze_traced() to rely
> 	  on __state == TASK_TRACED rather than task_is_traced() ?
>

Forgot to say, I think 1/5 needs some changes in any case...

ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
too. Perhaps I missed something, I'll reread 1/5 again, but the main
question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 10:16                   ` Oleg Nesterov
  2022-04-15 10:57                     ` Oleg Nesterov
@ 2022-04-15 12:00                     ` Peter Zijlstra
  2022-04-15 12:56                       ` Oleg Nesterov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-15 12:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Fri, Apr 15, 2022 at 12:16:44PM +0200, Oleg Nesterov wrote:
> On 04/15, Peter Zijlstra wrote:
> >
> > On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
> >
> > > If it can work, then 1/5 needs some changes, I think. In particular,
> > > it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
> >
> > That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
> > state, and isn't related to the freezer.
> 
> Lets forget about 3-5 which I didn't read carefully yet. So why do we
> need TRACED_FROZEN?

The purpose of 1/5 was to not have any unique state in __state. To at
all times be able to reconstruct __state from outside information (where
needed).

Agreed that this particular piece of state isn't needed until 5/5, but
the concept is independent (also 5/5 is insanely large already).

> From 1/5:
> 
> 	 static inline void signal_wake_up(struct task_struct *t, bool resume)
> 	 {
> 	+	lockdep_assert_held(&t->sighand->siglock);
> 	+
> 	+	if (resume && !(t->jobctl & JOBCTL_TRACED_FROZEN))
> 	+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> 	+
> 		signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> 	 }
> 	+
> 	 static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
> 	 {
> 	+	lockdep_assert_held(&t->sighand->siglock);
> 	+
> 	+	if (resume)
> 	+		t->jobctl &= ~JOBCTL_TRACED;
> 	+
> 		signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
> 	 }
> 
> Can't we simply change signal_wake_up_state(),
> 
> 	void signal_wake_up_state(struct task_struct *t, unsigned int state)
> 	{
> 		set_tsk_thread_flag(t, TIF_SIGPENDING);
> 		/*
> 		 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> 		 * case. We don't check t->state here because there is a race with it
> 		 * executing another processor and just now entering stopped state.
> 		 * By using wake_up_state, we ensure the process will wake up and
> 		 * handle its death signal.
> 		 */
> 		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
> 			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> 		else
> 			kick_process(t);
> 	}
> 
> ?

This would be broken when we so signal_wake_up_state() when state
doesn't match. Does that happen? I'm thikning siglock protects us from
the most obvious races, but still.

If not broken, then it needs at least a comment explaining why not etc..
I'm sure to not remember many of these details.

Also, signal_wake_up_state() really can do with that
lockdep_assert_held() as well ;-)

> > > 		/*
> > > 		 * We take the read lock around doing both checks to close a
> > > 		 * possible race where someone else attaches or detaches our
> > > 		 * natural child.
> > > 		 */
> > > 		read_lock(&tasklist_lock);
> > > 		traced = child->ptrace && child->parent == current;
> > > 		read_unlock(&tasklist_lock);
> > >
> > > 		if (!traced)
> > > 			return -ESRCH;
> >
> > The thing being, that if it is our ptrace child, it won't be going away
> > since we're running this code and not ptrace_detach().  Right?
> 
> Yes. and nobody else can detach it.
> 
> Another tracer can't attach until child->ptrace is cleared, but this can
> only happen if a) this child is killed and b) another thread does wait()
> and reaps it; but after that attach() is obviously impossible.
> 
> But since this child can go away, the patch changes ptrace_freeze_traced()
> to use lock_task_sighand().

Right.

> > > 		for (;;) {
> > > 			if (fatal_signal_pending(current))
> > > 				return -EINTR;
> >
> > What if signal_wake_up(.resume=true) happens here? In that case we miss
> > the fatal pending, and task state isn't changed yet so we'll happily go
> > sleep.
> 
> No, it won't sleep, see the signal_pending_state() check in schedule().

Urgh, forgot about that one ;-)

> > > 			set_current_state(TASK_KILLABLE);
> 
> And let me explain TASK_KILLABLE just in case... We could just use
> TASK_UNINTERRUPTIBLE and avoid the signal_pending() check, but KILLABLE
> looks "safer" to me. If the tracer hangs because of some bug, at least
> it can be killed from userspace.

Agreed.

> 
> > > 			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
> >
> >   TRACED_XXX ?
> 
> oops ;)
> 
> > > -	spin_lock_irq(&task->sighand->siglock);
> > >  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
> > >  	    !__fatal_signal_pending(task)) {
> > >  		task->jobctl |= JOBCTL_TRACED_FROZEN;
> > >  		WRITE_ONCE(task->__state, __TASK_TRACED);
> > >  		ret = true;
> > >  	}
> >
> > I would feel much better if this were still a task_func_call()
> > validating !->on_rq && !->on_cpu.
> 
> Well, but "on_rq || on_cpu" would mean that wait_task_inactive() is buggy ?

Yes, but I'm starting to feel a little paranoid here. Better safe than
sorry etc..

> But! I forgot to make anothet change in this code. I do not think it should
> rely on task_is_traced(). We are going to abuse task->__state, so I think
> it should check task->__state == TASK_TRACED directly. Say,
> 
> 	if (READ_ONCE(task->__state) == TASK_TRACED && ...) {
> 		WRITE_ONCE(task->__state, __TASK_TRACED);
> 		WARN_ON_ONCE(!task_is_traced(task));
> 		ret = true;
> 	}
> 
> looks more clean to me. What do you think?

Agreed on this.

> > > @@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
> > >  		 */
> > >  		if (gstop_done)
> > >  			do_notify_parent_cldstop(current, false, why);
> > > +		clear_traced_xxx();
> > > +		read_unlock(&tasklist_lock);
> > >
> > > -		/* tasklist protects us from ptrace_freeze_traced() */
> > > +		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */
> >
> > But... TRACED_XXX has just been cleared ?!
> 
> Cough ;) OK, I'll move __set_current_state() back under tasklist.
> 
> And in this case we do not need wake_up(parent), so we can shift it from
> clear_traced_xxx() into another branch.
> 
> OK, so far it seems that this patch needs a couple of simple fixes you
> pointed out, but before I send V2:
> 
> 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?

We can for the sake of 2 avoid TRACED_FROZEN, but as explained at the
start, the point of 1 was to ensure there is no unique state in __state,
and I think in that respect we can keep it, hmm?

> 	- will you agree if I change ptrace_freeze_traced() to rely
> 	  on __state == TASK_TRACED rather than task_is_traced() ?

Yes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 10:57                     ` Oleg Nesterov
@ 2022-04-15 12:01                       ` Peter Zijlstra
  2022-04-18 17:01                         ` Oleg Nesterov
  2022-04-20 10:20                       ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-15 12:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> On 04/15, Oleg Nesterov wrote:
> >
> > OK, so far it seems that this patch needs a couple of simple fixes you
> > pointed out, but before I send V2:
> >
> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> >
> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> >
> 
> Forgot to say, I think 1/5 needs some changes in any case...
> 
> ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> too.

Urgh, yes, I seemed to have missed that one :-(


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 12:00                     ` Peter Zijlstra
@ 2022-04-15 12:56                       ` Oleg Nesterov
  0 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-15 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/15, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:16:44PM +0200, Oleg Nesterov wrote:
> >
> > Lets forget about 3-5 which I didn't read carefully yet. So why do we
> > need TRACED_FROZEN?
>
> The purpose of 1/5 was to not have any unique state in __state. To at
> all times be able to reconstruct __state from outside information (where
> needed).
>
> Agreed that this particular piece of state isn't needed until 5/5, but
> the concept is independent (also 5/5 is insanely large already).

OK, so in my opinion it would be more clean if TRACED_FROZEN comes in a
separate (and simple) patch before 5/5.

I won't really argue, but to me this flag looks confusing and unnecessary
in 1-2 (which btw look like a -stable material to me).

> > Can't we simply change signal_wake_up_state(),
> >
> > 	void signal_wake_up_state(struct task_struct *t, unsigned int state)
> > 	{
> > 		set_tsk_thread_flag(t, TIF_SIGPENDING);
> > 		/*
> > 		 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> > 		 * case. We don't check t->state here because there is a race with it
> > 		 * executing another processor and just now entering stopped state.
> > 		 * By using wake_up_state, we ensure the process will wake up and
> > 		 * handle its death signal.
> > 		 */
> > 		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
> > 			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> > 		else
> > 			kick_process(t);
> > 	}
> >
> > ?
>
> This would be broken when we so signal_wake_up_state() when state
> doesn't match. Does that happen? I'm thikning siglock protects us from
> the most obvious races, but still.

Yes, even set_tsk_thread_flag(TIF_SIGPENDING) is not safe without siglock.

> Also, signal_wake_up_state() really can do with that
> lockdep_assert_held() as well ;-)

OK, lets add lockdep_assert_held() at the start of signal_wake_up_state ?

Agreed, this probably needs a comment, but this looks much simpler and
more understandable than 2 additional "if (resume)" checks in
signal_wake_up() and ptrace_signal_wake_up().

> > > > -	spin_lock_irq(&task->sighand->siglock);
> > > >  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
> > > >  	    !__fatal_signal_pending(task)) {
> > > >  		task->jobctl |= JOBCTL_TRACED_FROZEN;
> > > >  		WRITE_ONCE(task->__state, __TASK_TRACED);
> > > >  		ret = true;
> > > >  	}
> > >
> > > I would feel much better if this were still a task_func_call()
> > > validating !->on_rq && !->on_cpu.
> >
> > Well, but "on_rq || on_cpu" would mean that wait_task_inactive() is buggy ?
>
> Yes, but I'm starting to feel a little paranoid here. Better safe than
> sorry etc..

OK, can we simply add

	WARN_ON_ONCE(ret && (on_rq || on_cpu));

after unlock_task_sighand() ? this is racy without rq_lock but should
catch the possible problems.

> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
>
> We can for the sake of 2 avoid TRACED_FROZEN, but as explained at the
> start, the point of 1 was to ensure there is no unique state in __state,
> and I think in that respect we can keep it, hmm?

See above... I understand the purpose of TRACED_FROZEN (I hope ;),
but not in 1-2, and unless I missed something the change in signal_wake_up
above simply looks better to me, but of course this is subjective.

> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
>
> Yes.

Great, thanks. I'll return tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 12:01                       ` Peter Zijlstra
@ 2022-04-18 17:01                         ` Oleg Nesterov
  2022-04-18 17:19                           ` Oleg Nesterov
  2022-04-20 13:17                           ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-18 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/15, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> > On 04/15, Oleg Nesterov wrote:
> > >
> > > OK, so far it seems that this patch needs a couple of simple fixes you
> > > pointed out, but before I send V2:
> > >
> > > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> > >
> > > 	- will you agree if I change ptrace_freeze_traced() to rely
> > > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> > >
> >
> > Forgot to say, I think 1/5 needs some changes in any case...
> >
> > ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> > clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> > too.
>
> Urgh, yes, I seemed to have missed that one :-(

OK, I'll wait for updated version and send V3 on top of it.

I think the fixes are simple. ptrace_resume() should simply remove the
minor "need_siglock" optimization and clear JOBCTL_TRACED. As for the
"else" branch in ptrace_stop(), perhaps 1/5 can introduce the trivial

	static void clear_traced_jobctl_flags(unsigned long flags)
	{
		spin_lock_irq(&current->sighand->siglock);
		current->jobctl &= ~flags;
		spin_unlock_irq(&current->sighand->siglock);
	}

which can be reused by 2/5 to clear JOBCTL_TRACED_XXX. And, Peter, feel
free to join 1/5 and this patch if you think this makes any sense.

Meanwhile see V2 on top of the current version.

Oleg.


diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..ec104bae4e21 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,20 +193,24 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
+	if (!lock_task_sighand(task, &flags))
+		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
+	if (READ_ONCE(task->__state) == TASK_TRACED &&
+	    !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
 		task->jobctl |= JOBCTL_TRACED_FROZEN;
 		WRITE_ONCE(task->__state, __TASK_TRACED);
+		WARN_ON_ONCE(!task_is_traced(task));
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
@@ -253,40 +257,42 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
-
+	int traced;
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
+	if (!traced)
+		return -ESRCH;
+
+	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+	if (ignore_state)
+		return 0;
+
+	if (!task_is_traced(child))
+		return -ESRCH;
+
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_XXX)) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		schedule();
 	}
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..c7a89904cc4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2182,6 +2182,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_xxx(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	current->jobctl &= ~JOBCTL_TRACED_XXX;
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2220,7 +2227,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2282,6 +2289,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done && ptrace_reparented(current))
 			do_notify_parent_cldstop(current, false, why);
 
+		clear_traced_xxx();
+		wake_up_state(current->parent, TASK_KILLABLE);
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
@@ -2297,8 +2306,12 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	} else {
 		/*
 		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
+		 * Don't drop the lock yet, another tracer may come,
+		 * tasklist protects us from ptrace_freeze_traced().
+		 */
+		__set_current_state(TASK_RUNNING);
+		clear_traced_xxx();
+		/*
 		 * If @gstop_done, the ptracer went away between group stop
 		 * completion and here.  During detach, it would have set
 		 * JOBCTL_STOP_PENDING on us and we'll re-enter
@@ -2307,13 +2320,11 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		 */
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
+		read_unlock(&tasklist_lock);
 
-		/* tasklist protects us from ptrace_freeze_traced() */
-		__set_current_state(TASK_RUNNING);
 		read_code = false;
 		if (clear_code)
 			exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-18 17:01                         ` Oleg Nesterov
@ 2022-04-18 17:19                           ` Oleg Nesterov
  2022-04-20 13:17                           ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-18 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/18, Oleg Nesterov wrote:
>
>  static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  {
...
> +	if (!traced)
> +		return -ESRCH;
> +
> +	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +	if (ignore_state)
> +		return 0;
> +
> +	if (!task_is_traced(child))
> +		return -ESRCH;

This is the new check V1 didn't have, we need it to unsure that check_attach
can't miss the change in child->jobctl and call wait_task_inactive() before
the child marks itself as "traced" and clears JOBCTL_TRACED_XXX.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-15 10:57                     ` Oleg Nesterov
  2022-04-15 12:01                       ` Peter Zijlstra
@ 2022-04-20 10:20                       ` Peter Zijlstra
  2022-04-20 11:35                         ` Oleg Nesterov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-20 10:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> On 04/15, Oleg Nesterov wrote:
> >
> > OK, so far it seems that this patch needs a couple of simple fixes you
> > pointed out, but before I send V2:
> >
> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> >
> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> >
> 
> Forgot to say, I think 1/5 needs some changes in any case...
> 
> ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> too. Perhaps I missed something, I'll reread 1/5 again, but the main
> question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.

Ok, getting back to this. So I did the change to ptrace_resume(), but
I'm not entirely sure I understand the issue with the else branch of
ptrace_stop().

My understanding is that if we hit that else branch, we've raced wth
__ptrace_unlink(), and that will have done:

  if (... || task_is_traced(child))
	ptrace_signal_wake_up(child, true);

Which will have done that wakeup and cleared both __state and jobctl.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-20 10:20                       ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
@ 2022-04-20 11:35                         ` Oleg Nesterov
  0 siblings, 0 replies; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-20 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/20, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> >
> > ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> > clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> > too. Perhaps I missed something, I'll reread 1/5 again, but the main
> > question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.
>
> Ok, getting back to this. So I did the change to ptrace_resume(), but
> I'm not entirely sure I understand the issue with the else branch of
> ptrace_stop().
>
> My understanding is that if we hit that else branch, we've raced wth
> __ptrace_unlink(), and that will have done:

Yes, thanks for correcting me, I forgot that may_ptrace_stop() has gone.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-18 17:01                         ` Oleg Nesterov
  2022-04-18 17:19                           ` Oleg Nesterov
@ 2022-04-20 13:17                           ` Peter Zijlstra
  2022-04-20 18:03                             ` Oleg Nesterov
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-20 13:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Mon, Apr 18, 2022 at 07:01:05PM +0200, Oleg Nesterov wrote:

> diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> index ec8b312f7506..1b5a57048e13 100644
> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -22,7 +22,8 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED_BIT	24
>  #define JOBCTL_TRACED_BIT	25
> +#define JOBCTL_TRACED_XXX_BIT	25

26, also we must come up with a better name than tripple-x. In my head
it's started to be called TRACED_OLEG, but that can't be right either
;-)

Does something like:

#define JOBCTL_TRACED_BIT		25
#define JOBCTL_TRACED_QUIESCE_BIT	26

work?

> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0aea3f0a8002..c7a89904cc4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2182,6 +2182,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> +static void clear_traced_xxx(void)
> +{
> +	spin_lock_irq(&current->sighand->siglock);
> +	current->jobctl &= ~JOBCTL_TRACED_XXX;
> +	spin_unlock_irq(&current->sighand->siglock);
> +}
> +
>  /*
>   * This must be called with current->sighand->siglock held.
>   *
> @@ -2220,7 +2227,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>  	 */
> -	current->jobctl |= JOBCTL_TRACED;
> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
>  	set_special_state(TASK_TRACED);
>  
>  	/*
> @@ -2282,6 +2289,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		if (gstop_done && ptrace_reparented(current))
>  			do_notify_parent_cldstop(current, false, why);
>  
> +		clear_traced_xxx();
> +		wake_up_state(current->parent, TASK_KILLABLE);
>  		/*
>  		 * Don't want to allow preemption here, because
>  		 * sys_ptrace() needs this task to be inactive.
> @@ -2297,8 +2306,12 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	} else {
>  		/*
>  		 * By the time we got the lock, our tracer went away.
> -		 * Don't drop the lock yet, another tracer may come.
> -		 *
> +		 * Don't drop the lock yet, another tracer may come,
> +		 * tasklist protects us from ptrace_freeze_traced().
> +		 */
> +		__set_current_state(TASK_RUNNING);
> +		clear_traced_xxx();
> +		/*
>  		 * If @gstop_done, the ptracer went away between group stop
>  		 * completion and here.  During detach, it would have set
>  		 * JOBCTL_STOP_PENDING on us and we'll re-enter

This is that same else clause again; perhaps make signal_wake_up_state()
also clear TRACED_XXX instead?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT
  2022-04-20 13:17                           ` Peter Zijlstra
@ 2022-04-20 18:03                             ` Oleg Nesterov
  2022-04-20 20:54                               ` [RFC][PATCH] ptrace: Don't change __state Eric W. Biederman
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-20 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, mingo, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On 04/20, Peter Zijlstra wrote:
>
> Does something like:
>
> #define JOBCTL_TRACED_BIT		25
> #define JOBCTL_TRACED_QUIESCE_BIT	26
>
> work?

Agreed! ;)

> >  	} else {
> >  		/*
> >  		 * By the time we got the lock, our tracer went away.
> > -		 * Don't drop the lock yet, another tracer may come.
> > -		 *
> > +		 * Don't drop the lock yet, another tracer may come,
> > +		 * tasklist protects us from ptrace_freeze_traced().
> > +		 */
> > +		__set_current_state(TASK_RUNNING);
> > +		clear_traced_xxx();
> > +		/*
> >  		 * If @gstop_done, the ptracer went away between group stop
> >  		 * completion and here.  During detach, it would have set
> >  		 * JOBCTL_STOP_PENDING on us and we'll re-enter
>
> This is that same else clause again; perhaps make signal_wake_up_state()
> also clear TRACED_XXX instead?

I swear, I too thought about this after my last email. Yes, I think this
makes sense. Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC][PATCH] ptrace: Don't change __state
  2022-04-20 18:03                             ` Oleg Nesterov
@ 2022-04-20 20:54                               ` Eric W. Biederman
  2022-04-21  7:21                                 ` Peter Zijlstra
  2022-04-21  9:46                                 ` Oleg Nesterov
  0 siblings, 2 replies; 37+ messages in thread
From: Eric W. Biederman @ 2022-04-20 20:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm


I was thinking about this and I have an approach from a different
direction.  In particular it removes the need for ptrace_freeze_attach
and ptrace_unfreeze_attach to change __state.  Instead a jobctl
bit is used to suppress waking up a process with TASK_WAKEKILL.

I think this would be a good technique to completely decouple
PREEMPT_RT from the work that ptrace_freeze_attach does.

Comments?

Eric

---

The games ptrace_freeze_traced and ptrace_unfreeze_traced have
been playing with __state to remove TASK_WAKEKILL from __state
while a ptrace command is executing.

Instead add JOBCTL_DELAY_WAKEILL and set it in ptrace_freeze_task and
clear it in ptrace_unfreeze_task and the final exit of ptrace_stop.

Then in signal_wake_up_state drop TASK_WAKEKILL if it is sent while
JOBCTL_DELAY_WAKEKILL is set.

As signal_wake_up is the only function that sets TASK_WAKEKILL when
waking a process this achives the same effect with less messing
with state.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

As a proof of concept this seems to work most of the time,
unfortunately I have a bug somewhere (a memory stomp?) and I am
hitting a BUG_ON that asserts __send_signal is not holding
the siglock.

[   97.000168] ------------[ cut here ]------------
[   97.001782] kernel BUG at kernel/signal.c:1086!
[   97.002539] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   97.002846] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.18.0-rc1userns+ #449
[   97.003135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[   97.003480] RIP: 0010:__send_signal+0x256/0x580
[   97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40
[   97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046
[   97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001
[   97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e
[   97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000
[   97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000
[   97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3
[   97.003812] FS:  0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000
[   97.003812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0
[   97.003812] Call Trace:
[   97.003812]  <IRQ>
[   97.003812]  group_send_sig_info+0xe1/0x190
[   97.003812]  kill_pid_info+0x78/0x150
[   97.003812]  it_real_fn+0x35/0xe0
[   97.003812]  ? __x64_sys_setitimer+0x150/0x150
[   97.003812]  __hrtimer_run_queues+0x1ca/0x3f0
[   97.003812]  hrtimer_interrupt+0x106/0x220
[   97.003812]  __sysvec_apic_timer_interrupt+0x96/0x260
[   97.003812]  sysvec_apic_timer_interrupt+0x9d/0xd0
[   97.003812]  </IRQ>
[   97.003812]  <TASK>
[   97.003812]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[   97.003812] RIP: 0010:default_idle+0x10/0x20
[   97.003812] Code: 8b 04 25 00 b0 01 00 f0 80 60 02 df c3 0f ae f0 0f ae 38 0f ae f0 eb b9 66 90 0f 1f 44 00 00 eb 07 03
[   97.003812] RSP: 0018:ffffc90000093ef8 EFLAGS: 00000246
[   97.003812] RAX: ffffffff823fafb0 RBX: ffff8880056f97c0 RCX: 0000000000000000
[   97.003812] RDX: ffffffff81193e1d RSI: ffffffff82da5301 RDI: ffffffff823fb111
[   97.003812] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[   97.003812] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[   97.003812] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   97.003812]  ? mwait_idle+0x70/0x70
[   97.003812]  ? nohz_balance_enter_idle+0x6d/0x220
[   97.003812]  ? default_idle_call+0x21/0x90
[   97.003812]  ? mwait_idle+0x70/0x70
[   97.003812]  default_idle_call+0x59/0x90
[   97.003812]  do_idle+0x1f3/0x260
[   97.003812]  ? finish_task_switch.isra.0+0xef/0x350
[   97.003812]  cpu_startup_entry+0x19/0x20
[   97.003812]  secondary_startup_64_no_verify+0xc3/0xcb
[   97.003812]  </TASK>
[   97.003812] Modules linked in:
[   97.003812] ---[ end trace 0000000000000000 ]---
[   97.003812] RIP: 0010:__send_signal+0x256/0x580
[   97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40
[   97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046
[   97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001
[   97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e
[   97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000
[   97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000
[   97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3
[   97.003812] FS:  0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000
[   97.003812] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0
[   97.003812] Kernel panic - not syncing: Fatal exception in interrupt
[   97.003812] Kernel Offset: disabled
[   97.003812] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
---
 include/linux/sched/jobctl.h |  2 ++
 kernel/ptrace.c              | 32 ++++++++++++++------------------
 kernel/signal.c              |  5 +++++
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..4e154ad8205f 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@ struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_DELAY_WAKEKILL_BIT	24	/* delay killable wakeups */
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@ struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_DELAY_WAKEKILL	(1UL << JOBCTL_DELAY_WAKEKILL_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 4f78d0d5940c..a55320a0e8e3 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -197,7 +197,8 @@ static bool ptrace_freeze_traced(struct task_struct *task)
 	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
-		WRITE_ONCE(task->__state, __TASK_TRACED);
+		//WARN_ON((task->jobctl & JOBCTL_DELAY_WAKEKILL));
+		task->jobctl |= JOBCTL_DELAY_WAKEKILL;
 		ret = true;
 	}
 	spin_unlock_irq(&task->sighand->siglock);
@@ -207,7 +208,7 @@ static bool ptrace_freeze_traced(struct task_struct *task)
 
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
-	if (READ_ONCE(task->__state) != __TASK_TRACED)
+	if (!task_is_traced(task))
 		return;
 
 	WARN_ON(!task->ptrace || task->parent != current);
@@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
 	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
 	 * Recheck state under the lock to close this race.
 	 */
-	spin_lock_irq(&task->sighand->siglock);
-	if (READ_ONCE(task->__state) == __TASK_TRACED) {
-		if (__fatal_signal_pending(task))
-			wake_up_state(task, __TASK_TRACED);
-		else
-			WRITE_ONCE(task->__state, TASK_TRACED);
-	}
+	spin_unlock_irq(&task->sighand->siglock);
+	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
+	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
+	if (fatal_signal_pending(task))
+		wake_up_state(task, TASK_WAKEKILL);
 	spin_unlock_irq(&task->sighand->siglock);
 }
 
@@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	 */
 	read_lock(&tasklist_lock);
 	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+		WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
@@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 	read_unlock(&tasklist_lock);
 
 	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
+		if (!wait_task_inactive(child, TASK_TRACED)) {
 			/*
 			 * This can only happen if may_ptrace_stop() fails and
 			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
+			 * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL.
 			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+			WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL));
 			ret = -ESRCH;
 		}
 	}
@@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request,
 	 * status and clears the code too; this can't race with the tracee, it
 	 * takes siglock after resume.
 	 */
-	need_siglock = data && !thread_group_empty(current);
 	if (need_siglock)
 		spin_lock_irq(&child->sighand->siglock);
 	child->exit_code = data;
@@ -1325,8 +1323,7 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr,
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
-	if (ret || request != PTRACE_DETACH)
-		ptrace_unfreeze_traced(child);
+	ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1468,8 +1465,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_long_t, request, compat_long_t, pid,
 				  request == PTRACE_INTERRUPT);
 	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
-		if (ret || request != PTRACE_DETACH)
-			ptrace_unfreeze_traced(child);
+		ptrace_unfreeze_traced(child);
 	}
 
  out_put_task_struct:
diff --git a/kernel/signal.c b/kernel/signal.c
index 30cd1ca43bcd..a1277580c92e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -762,6 +762,10 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	/* Suppress wakekill? */
+	if (t->jobctl & JOBCTL_DELAY_WAKEKILL)
+		state &= ~TASK_WAKEKILL;
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
@@ -2328,6 +2332,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 
 	/* LISTENING can be set only during STOP traps, clear it */
 	current->jobctl &= ~JOBCTL_LISTENING;
+	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
 
 	/*
 	 * Queued signals ignored us while we were stopped for tracing.
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-20 20:54                               ` [RFC][PATCH] ptrace: Don't change __state Eric W. Biederman
@ 2022-04-21  7:21                                 ` Peter Zijlstra
  2022-04-21 10:26                                   ` Peter Zijlstra
  2022-04-21 14:45                                   ` Eric W. Biederman
  2022-04-21  9:46                                 ` Oleg Nesterov
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-21  7:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote:
> 
> I was thinking about this and I have an approach from a different
> direction.  In particular it removes the need for ptrace_freeze_attach
> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
> bit is used to suppress waking up a process with TASK_WAKEKILL.
> 
> I think this would be a good technique to completely decouple
> PREEMPT_RT from the work that ptrace_freeze_attach does.
> 
> Comments?

On first read-through, I like it! A few comments down below..

> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>  	 * Recheck state under the lock to close this race.
>  	 */
> -	spin_lock_irq(&task->sighand->siglock);
> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
> -		if (__fatal_signal_pending(task))
> -			wake_up_state(task, __TASK_TRACED);
> -		else
> -			WRITE_ONCE(task->__state, TASK_TRACED);
> -	}
> +	spin_unlock_irq(&task->sighand->siglock);

  ^^^^ this should be spin_lock_irq(...)

> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
> +	if (fatal_signal_pending(task))
> +		wake_up_state(task, TASK_WAKEKILL);
>  	spin_unlock_irq(&task->sighand->siglock);
>  }
>  
> @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	 */
>  	read_lock(&tasklist_lock);
>  	if (child->ptrace && child->parent == current) {
> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +		WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>  		/*
>  		 * child->sighand can't be NULL, release_task()
>  		 * does ptrace_unlink() before __exit_signal().
> @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	read_unlock(&tasklist_lock);
>  
>  	if (!ret && !ignore_state) {
> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> +		if (!wait_task_inactive(child, TASK_TRACED)) {

This is still very dubious, there are spinlocks between
set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
can fail where we don't want it to due to TASK_TRACED being temporarily
held in ->saved_state.

>  			/*
>  			 * This can only happen if may_ptrace_stop() fails and
>  			 * ptrace_stop() changes ->state back to TASK_RUNNING,
> -			 * so we should not worry about leaking __TASK_TRACED.
> +			 * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL.
>  			 */
> +			WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL));
>  			ret = -ESRCH;
>  		}
>  	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-20 20:54                               ` [RFC][PATCH] ptrace: Don't change __state Eric W. Biederman
  2022-04-21  7:21                                 ` Peter Zijlstra
@ 2022-04-21  9:46                                 ` Oleg Nesterov
  2022-04-21 15:01                                   ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-21  9:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

On 04/20, Eric W. Biederman wrote:
>
> I was thinking about this and I have an approach from a different
> direction.  In particular it removes the need for ptrace_freeze_attach
> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
> bit is used to suppress waking up a process with TASK_WAKEKILL.

I think this can work, but we still need something like 1/5 + 2/5?

> I think this would be a good technique to completely decouple
> PREEMPT_RT from the work that ptrace_freeze_attach does.

If CONFIG_RT=y we can't rely on the ->__state check in task_is_traced(),
and wait_task_inactive() can wrongly fail if the tracee sleeps waiting
for tasklist_lock.

A couple of comments after a quick glance,

>  static void ptrace_unfreeze_traced(struct task_struct *task)
>  {
> -	if (READ_ONCE(task->__state) != __TASK_TRACED)
> +	if (!task_is_traced(task))
>  		return;
>
>  	WARN_ON(!task->ptrace || task->parent != current);
> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>  	 * Recheck state under the lock to close this race.
>  	 */
> -	spin_lock_irq(&task->sighand->siglock);
> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
> -		if (__fatal_signal_pending(task))
> -			wake_up_state(task, __TASK_TRACED);
> -		else
> -			WRITE_ONCE(task->__state, TASK_TRACED);
> -	}
> +	spin_unlock_irq(&task->sighand->siglock);
> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;

We can't rely on the lockless task_is_traced() check above... probably this
is fine, but I need to re-chesk. But at least you need to remove the comment
about PTRACE_LISTEN above.

Another problem is that WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL))
doesn't look right if ignore_state in ptrace_check_attach() was true, the
tracee could stop before ptrace_unfreeze_traced().

> @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request,
>  	 * status and clears the code too; this can't race with the tracee, it
>  	 * takes siglock after resume.
>  	 */
> -	need_siglock = data && !thread_group_empty(current);
>  	if (need_siglock)
>  		spin_lock_irq(&child->sighand->siglock);

Hmm?

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-21  7:21                                 ` Peter Zijlstra
@ 2022-04-21 10:26                                   ` Peter Zijlstra
  2022-04-21 10:49                                     ` Oleg Nesterov
  2022-04-21 14:45                                   ` Eric W. Biederman
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-21 10:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

On Thu, Apr 21, 2022 at 09:21:38AM +0200, Peter Zijlstra wrote:
> > @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> >  	read_unlock(&tasklist_lock);
> >  
> >  	if (!ret && !ignore_state) {
> > -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> > +		if (!wait_task_inactive(child, TASK_TRACED)) {
> 
> This is still very dubious, there are spinlocks between
> set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
> can fail where we don't want it to due to TASK_TRACED being temporarily
> held in ->saved_state.

As such, I've taken the liberty of munging yours and Oleg's approach
together. I've yet to actually test this but it looks feasible.

---
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,9 +19,11 @@ struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_DELAY_WAKEKILL_BIT 24	/* delay killable wakeups */
 
-#define JOBCTL_STOPPED_BIT	24
-#define JOBCTL_TRACED_BIT	25
+#define JOBCTL_STOPPED_BIT	25
+#define JOBCTL_TRACED_BIT	26
+#define JOBCTL_TRACED_QUIESCE_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -31,9 +33,11 @@ struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_DELAY_WAKEKILL	(1UL << JOBCTL_DELAY_WAKEKILL_BIT)
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_QUIESCE	(1UL << JOBCTL_TRACED_QUIESCE_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,26 +193,32 @@ static bool looks_like_a_spurious_pid(st
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
+	if (!lock_task_sighand(task, &flags))
+		return ret;
+
+	if (READ_ONCE(task->__state) == TASK_TRACED &&
+	    !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
-		WRITE_ONCE(task->__state, __TASK_TRACED);
+		WARN_ON_ONCE(!task_is_traced(task));
+		WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL);
+		task->jobctl |= JOBCTL_DELAY_WAKEKILL;
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
 
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
-	if (READ_ONCE(task->__state) != __TASK_TRACED)
+	if (!task_is_traced(task))
 		return;
 
 	WARN_ON(!task->ptrace || task->parent != current);
@@ -222,12 +228,11 @@ static void ptrace_unfreeze_traced(struc
 	 * Recheck state under the lock to close this race.
 	 */
 	spin_lock_irq(&task->sighand->siglock);
-	if (READ_ONCE(task->__state) == __TASK_TRACED) {
-		if (__fatal_signal_pending(task)) {
-			task->jobctl &= ~JOBCTL_TRACED;
-			wake_up_state(task, __TASK_TRACED);
-		} else
-			WRITE_ONCE(task->__state, TASK_TRACED);
+//	WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
+	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
+	if (__fatal_signal_pending(task)) {
+		task->jobctl &= ~JOBCTL_TRACED;
+		wake_up_state(task, TASK_WAKEKILL);
 	}
 	spin_unlock_irq(&task->sighand->siglock);
 }
@@ -251,40 +256,46 @@ static void ptrace_unfreeze_traced(struc
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
+	int traced;
 
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
+	if (!traced)
+		return -ESRCH;
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
-		}
+	WARN_ON_ONCE(READ_ONCE(child->__state) == __TASK_TRACED);
+	WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL);
+
+	if (ignore_state)
+		return 0;
+
+	if (!task_is_traced(child))
+		return -ESRCH;
+
+	/* wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop() */
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE))
+			break;
+
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
@@ -1329,8 +1340,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
-	if (ret || request != PTRACE_DETACH)
-		ptrace_unfreeze_traced(child);
+	ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1472,8 +1482,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo
 				  request == PTRACE_INTERRUPT);
 	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
-		if (ret || request != PTRACE_DETACH)
-			ptrace_unfreeze_traced(child);
+		ptrace_unfreeze_traced(child);
 	}
 
  out_put_task_struct:
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st
 {
 	lockdep_assert_held(&t->sighand->siglock);
 
+	/* Suppress wakekill? */
+	if (t->jobctl & JOBCTL_DELAY_WAKEKILL)
+		state &= ~TASK_WAKEKILL;
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 
 	/*
@@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st
 	 * handle its death signal.
 	 */
 	if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
-		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE);
 	else
 		kick_process(t);
 }
@@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_quiesce(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));
+	current->jobctl &= ~JOBCTL_TRACED_QUIESCE;
+	wake_up_state(current->parent, TASK_KILLABLE);
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
-		 *
-		 * XXX: implement read_unlock_no_resched().
 		 */
 		preempt_disable();
 		read_unlock(&tasklist_lock);
-		cgroup_enter_frozen();
+		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
+
+		/*
+		 * JOBCTL_TRACE_QUIESCE bridges the gap between
+		 * set_current_state(TASK_TRACED) above and schedule() below.
+		 * There must not be any blocking (specifically anything that
+		 * touched ->saved_state on PREEMPT_RT) between here and
+		 * schedule().
+		 *
+		 * ptrace_check_attach() relies on this with its
+		 * wait_task_inactive() usage.
+		 */
+		clear_traced_quiesce();
+
 		preempt_enable_no_resched();
 		freezable_schedule();
+
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in
 
 	/* LISTENING can be set only during STOP traps, clear it */
 	current->jobctl &= ~JOBCTL_LISTENING;
+	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
 
 	/*
 	 * Queued signals ignored us while we were stopped for tracing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-21 10:26                                   ` Peter Zijlstra
@ 2022-04-21 10:49                                     ` Oleg Nesterov
  2022-04-21 11:50                                       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Oleg Nesterov @ 2022-04-21 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

On 04/21, Peter Zijlstra wrote:
>
> As such, I've taken the liberty of munging yours and Oleg's approach
> together. I've yet to actually test this but it looks feasible.

Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so,
and I can't apply it with or without 1/5. So I am not sure I understand
what exactly it does.

Oleg.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-21 10:49                                     ` Oleg Nesterov
@ 2022-04-21 11:50                                       ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2022-04-21 11:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

On Thu, Apr 21, 2022 at 12:49:43PM +0200, Oleg Nesterov wrote:
> On 04/21, Peter Zijlstra wrote:
> >
> > As such, I've taken the liberty of munging yours and Oleg's approach
> > together. I've yet to actually test this but it looks feasible.
> 
> Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so,
> and I can't apply it with or without 1/5. So I am not sure I understand
> what exactly it does.

Yes, it is on top of a modified 1, I mean to post an updated and tested
series later today.

Basically it does the TRACED_QUIESCE thing from your patch, and then
does the DELAY_WAKEKILL thing from Eric's patch.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-21  7:21                                 ` Peter Zijlstra
  2022-04-21 10:26                                   ` Peter Zijlstra
@ 2022-04-21 14:45                                   ` Eric W. Biederman
  1 sibling, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2022-04-21 14:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote:
>> 
>> I was thinking about this and I have an approach from a different
>> direction.  In particular it removes the need for ptrace_freeze_attach
>> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
>> bit is used to suppress waking up a process with TASK_WAKEKILL.
>> 
>> I think this would be a good technique to completely decouple
>> PREEMPT_RT from the work that ptrace_freeze_attach does.
>> 
>> Comments?
>
> On first read-through, I like it! A few comments down below..
>
>> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>>  	 * Recheck state under the lock to close this race.
>>  	 */
>> -	spin_lock_irq(&task->sighand->siglock);
>> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
>> -		if (__fatal_signal_pending(task))
>> -			wake_up_state(task, __TASK_TRACED);
>> -		else
>> -			WRITE_ONCE(task->__state, TASK_TRACED);
>> -	}
>> +	spin_unlock_irq(&task->sighand->siglock);
>
>   ^^^^ this should be spin_lock_irq(...)

Doh!

Thank you for spotting that.  That solves my nasty splat in
__send_signal.


>
>> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
>> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
>> +	if (fatal_signal_pending(task))
>> +		wake_up_state(task, TASK_WAKEKILL);
>>  	spin_unlock_irq(&task->sighand->siglock);
>>  }
>>  
>> @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>>  	 */
>>  	read_lock(&tasklist_lock);
>>  	if (child->ptrace && child->parent == current) {
>> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
>> +		WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>>  		/*
>>  		 * child->sighand can't be NULL, release_task()
>>  		 * does ptrace_unlink() before __exit_signal().
>> @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>>  	read_unlock(&tasklist_lock);
>>  
>>  	if (!ret && !ignore_state) {
>> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
>> +		if (!wait_task_inactive(child, TASK_TRACED)) {
>
> This is still very dubious, there are spinlocks between
> set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
> can fail where we don't want it to due to TASK_TRACED being temporarily
> held in ->saved_state.

When it comes to PREEMPT_RT yes.

I think we might be able to remove the wait_task_inactive, I am
not certain what it gets us.

All this change gets us is the removal of playing with __state.

Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC][PATCH] ptrace: Don't change __state
  2022-04-21  9:46                                 ` Oleg Nesterov
@ 2022-04-21 15:01                                   ` Eric W. Biederman
  0 siblings, 0 replies; 37+ messages in thread
From: Eric W. Biederman @ 2022-04-21 15:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, rjw, mingo, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/20, Eric W. Biederman wrote:
>> @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request,
>>  	 * status and clears the code too; this can't race with the tracee, it
>>  	 * takes siglock after resume.
>>  	 */
>> -	need_siglock = data && !thread_group_empty(current);
>>  	if (need_siglock)
>>  		spin_lock_irq(&child->sighand->siglock);
>
> Hmm?

A half backed out change (I thought ptrace_resume would need to clear
JOBCTL_DELAY_WAKEKILL) in ptrace_resume.  I somehow failed to
restore the need_siglock line.

Eric


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-04-21 15:01 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-12 11:44 [PATCH 0/5] ptrace-vs-PREEMPT_RT and freezer rewrite Peter Zijlstra
2022-04-12 11:44 ` [PATCH 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Peter Zijlstra
2022-04-13 13:29   ` Oleg Nesterov
2022-04-13 16:47     ` Peter Zijlstra
2022-04-12 11:44 ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
2022-04-13 13:24   ` Oleg Nesterov
2022-04-13 16:58     ` Peter Zijlstra
2022-04-13 18:57     ` Oleg Nesterov
2022-04-13 18:59       ` Oleg Nesterov
2022-04-13 19:20         ` Peter Zijlstra
2022-04-13 19:56           ` Peter Zijlstra
2022-04-14 11:54             ` Oleg Nesterov
2022-04-14 12:08               ` Oleg Nesterov
2022-04-14 18:34               ` Oleg Nesterov
2022-04-14 22:45                 ` Peter Zijlstra
2022-04-15 10:16                   ` Oleg Nesterov
2022-04-15 10:57                     ` Oleg Nesterov
2022-04-15 12:01                       ` Peter Zijlstra
2022-04-18 17:01                         ` Oleg Nesterov
2022-04-18 17:19                           ` Oleg Nesterov
2022-04-20 13:17                           ` Peter Zijlstra
2022-04-20 18:03                             ` Oleg Nesterov
2022-04-20 20:54                               ` [RFC][PATCH] ptrace: Don't change __state Eric W. Biederman
2022-04-21  7:21                                 ` Peter Zijlstra
2022-04-21 10:26                                   ` Peter Zijlstra
2022-04-21 10:49                                     ` Oleg Nesterov
2022-04-21 11:50                                       ` Peter Zijlstra
2022-04-21 14:45                                   ` Eric W. Biederman
2022-04-21  9:46                                 ` Oleg Nesterov
2022-04-21 15:01                                   ` Eric W. Biederman
2022-04-20 10:20                       ` [PATCH 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT Peter Zijlstra
2022-04-20 11:35                         ` Oleg Nesterov
2022-04-15 12:00                     ` Peter Zijlstra
2022-04-15 12:56                       ` Oleg Nesterov
2022-04-12 11:44 ` [PATCH 3/5] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
2022-04-12 11:44 ` [PATCH 4/5] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
2022-04-12 11:44 ` [PATCH 5/5] freezer,sched: Rewrite core freezer logic Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).