linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Freezer Rewrite
@ 2022-08-22 11:18 Peter Zijlstra
  2022-08-22 11:18 ` [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Hi all,

With Eric picking up the ptrace patches adding JOBCTL_STOPPED / JOBCTL_TRACED
and them having landed in Linus' tree, here a respin of the Freezer rewrite
that relies on it.

---
 drivers/acpi/x86/s2idle.c         |   12 +
 drivers/android/binder.c          |    4 
 drivers/media/pci/pt3/pt3.c       |    4 
 drivers/scsi/scsi_transport_spi.c |    7 -
 fs/cifs/inode.c                   |    4 
 fs/cifs/transport.c               |    5 
 fs/coredump.c                     |    5 
 fs/nfs/file.c                     |    3 
 fs/nfs/inode.c                    |   12 -
 fs/nfs/nfs3proc.c                 |    3 
 fs/nfs/nfs4proc.c                 |   14 +-
 fs/nfs/nfs4state.c                |    3 
 fs/nfs/pnfs.c                     |    4 
 fs/xfs/xfs_trans_ail.c            |    8 -
 include/linux/completion.h        |    1 
 include/linux/freezer.h           |  245 +-------------------------------------
 include/linux/sched.h             |   41 +++---
 include/linux/sunrpc/sched.h      |    7 -
 include/linux/suspend.h           |    8 -
 include/linux/umh.h               |    9 -
 include/linux/wait.h              |   40 +++++-
 init/do_mounts_initrd.c           |   10 -
 kernel/cgroup/legacy_freezer.c    |   23 +--
 kernel/exit.c                     |    4 
 kernel/fork.c                     |    5 
 kernel/freezer.c                  |  133 ++++++++++++++------
 kernel/futex/waitwake.c           |    8 -
 kernel/hung_task.c                |    4 
 kernel/power/hibernate.c          |   35 +++--
 kernel/power/main.c               |   18 +-
 kernel/power/process.c            |   10 -
 kernel/power/suspend.c            |   12 +
 kernel/power/user.c               |   24 ++-
 kernel/ptrace.c                   |    2 
 kernel/sched/completion.c         |    9 +
 kernel/sched/core.c               |    6 
 kernel/signal.c                   |   14 +-
 kernel/time/hrtimer.c             |    4 
 kernel/umh.c                      |   18 +-
 mm/khugepaged.c                   |    4 
 net/sunrpc/sched.c                |   12 -
 net/unix/af_unix.c                |    8 -
 42 files changed, 341 insertions(+), 461 deletions(-)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-08-23 17:25   ` Rafael J. Wysocki
  2022-08-22 11:18 ` [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Rafael explained that the reason for having both PF_NOFREEZE and
PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from
kthread context that has previously called set_freezable().

In preparation of merging the flags, have {,un}lock_system_slee() save
and restore current->flags.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/acpi/x86/s2idle.c         |   12 ++++++++----
 drivers/scsi/scsi_transport_spi.c |    7 ++++---
 include/linux/suspend.h           |    8 ++++----
 kernel/power/hibernate.c          |   35 ++++++++++++++++++++++-------------
 kernel/power/main.c               |   16 ++++++++++------
 kernel/power/suspend.c            |   12 ++++++++----
 kernel/power/user.c               |   24 ++++++++++++++----------
 7 files changed, 70 insertions(+), 44 deletions(-)

--- a/drivers/acpi/x86/s2idle.c
+++ b/drivers/acpi/x86/s2idle.c
@@ -541,12 +541,14 @@ void acpi_s2idle_setup(void)
 
 int acpi_register_lps0_dev(struct acpi_s2idle_dev_ops *arg)
 {
+	unsigned int sleep_flags;
+
 	if (!lps0_device_handle || sleep_no_lps0)
 		return -ENODEV;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	list_add(&arg->list_node, &lps0_s2idle_devops_head);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return 0;
 }
@@ -554,12 +556,14 @@ EXPORT_SYMBOL_GPL(acpi_register_lps0_dev
 
 void acpi_unregister_lps0_dev(struct acpi_s2idle_dev_ops *arg)
 {
+	unsigned int sleep_flags;
+
 	if (!lps0_device_handle || sleep_no_lps0)
 		return;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	list_del(&arg->list_node);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(acpi_unregister_lps0_dev);
 
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -998,8 +998,9 @@ void
 spi_dv_device(struct scsi_device *sdev)
 {
 	struct scsi_target *starget = sdev->sdev_target;
-	u8 *buffer;
 	const int len = SPI_MAX_ECHO_BUFFER_SIZE*2;
+	unsigned int sleep_flags;
+	u8 *buffer;
 
 	/*
 	 * Because this function and the power management code both call
@@ -1007,7 +1008,7 @@ spi_dv_device(struct scsi_device *sdev)
 	 * while suspend or resume is in progress. Hence the
 	 * lock/unlock_system_sleep() calls.
 	 */
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (scsi_autopm_get_device(sdev))
 		goto unlock_system_sleep;
@@ -1058,7 +1059,7 @@ spi_dv_device(struct scsi_device *sdev)
 	scsi_autopm_put_device(sdev);
 
 unlock_system_sleep:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL(spi_dv_device);
 
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -510,8 +510,8 @@ extern bool pm_save_wakeup_count(unsigne
 extern void pm_wakep_autosleep_enabled(bool set);
 extern void pm_print_active_wakeup_sources(void);
 
-extern void lock_system_sleep(void);
-extern void unlock_system_sleep(void);
+extern unsigned int lock_system_sleep(void);
+extern void unlock_system_sleep(unsigned int);
 
 #else /* !CONFIG_PM_SLEEP */
 
@@ -534,8 +534,8 @@ static inline void pm_system_wakeup(void
 static inline void pm_wakeup_clear(bool reset) {}
 static inline void pm_system_irq_wakeup(unsigned int irq_number) {}
 
-static inline void lock_system_sleep(void) {}
-static inline void unlock_system_sleep(void) {}
+static inline unsigned int lock_system_sleep(void) { return 0; }
+static inline void unlock_system_sleep(unsigned int flags) {}
 
 #endif /* !CONFIG_PM_SLEEP */
 
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -92,20 +92,24 @@ bool hibernation_available(void)
  */
 void hibernation_set_ops(const struct platform_hibernation_ops *ops)
 {
+	unsigned int sleep_flags;
+
 	if (ops && !(ops->begin && ops->end &&  ops->pre_snapshot
 	    && ops->prepare && ops->finish && ops->enter && ops->pre_restore
 	    && ops->restore_cleanup && ops->leave)) {
 		WARN_ON(1);
 		return;
 	}
-	lock_system_sleep();
+
+	sleep_flags = lock_system_sleep();
+
 	hibernation_ops = ops;
 	if (ops)
 		hibernation_mode = HIBERNATION_PLATFORM;
 	else if (hibernation_mode == HIBERNATION_PLATFORM)
 		hibernation_mode = HIBERNATION_SHUTDOWN;
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(hibernation_set_ops);
 
@@ -713,6 +717,7 @@ static int load_image_and_restore(void)
 int hibernate(void)
 {
 	bool snapshot_test = false;
+	unsigned int sleep_flags;
 	int error;
 
 	if (!hibernation_available()) {
@@ -720,7 +725,7 @@ int hibernate(void)
 		return -EPERM;
 	}
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	/* The snapshot device should not be opened while we're running */
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -794,7 +799,7 @@ int hibernate(void)
 	pm_restore_console();
 	hibernate_release();
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 	pr_info("hibernation exit\n");
 
 	return error;
@@ -809,9 +814,10 @@ int hibernate(void)
  */
 int hibernate_quiet_exec(int (*func)(void *data), void *data)
 {
+	unsigned int sleep_flags;
 	int error;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -891,7 +897,7 @@ int hibernate_quiet_exec(int (*func)(voi
 	hibernate_release();
 
 unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error;
 }
@@ -1100,11 +1106,12 @@ static ssize_t disk_show(struct kobject
 static ssize_t disk_store(struct kobject *kobj, struct kobj_attribute *attr,
 			  const char *buf, size_t n)
 {
+	int mode = HIBERNATION_INVALID;
+	unsigned int sleep_flags;
 	int error = 0;
-	int i;
 	int len;
 	char *p;
-	int mode = HIBERNATION_INVALID;
+	int i;
 
 	if (!hibernation_available())
 		return -EPERM;
@@ -1112,7 +1119,7 @@ static ssize_t disk_store(struct kobject
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
 		if (len == strlen(hibernation_modes[i])
 		    && !strncmp(buf, hibernation_modes[i], len)) {
@@ -1142,7 +1149,7 @@ static ssize_t disk_store(struct kobject
 	if (!error)
 		pm_pr_dbg("Hibernation mode set to '%s'\n",
 			       hibernation_modes[mode]);
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 	return error ? error : n;
 }
 
@@ -1158,9 +1165,10 @@ static ssize_t resume_show(struct kobjec
 static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr,
 			    const char *buf, size_t n)
 {
-	dev_t res;
+	unsigned int sleep_flags;
 	int len = n;
 	char *name;
+	dev_t res;
 
 	if (len && buf[len-1] == '\n')
 		len--;
@@ -1173,9 +1181,10 @@ static ssize_t resume_store(struct kobje
 	if (!res)
 		return -EINVAL;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 	swsusp_resume_device = res;
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
+
 	pm_pr_dbg("Configured hibernation resume from disk to %u\n",
 		  swsusp_resume_device);
 	noresume = 0;
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -21,14 +21,16 @@
 
 #ifdef CONFIG_PM_SLEEP
 
-void lock_system_sleep(void)
+unsigned int lock_system_sleep(void)
 {
+	unsigned int flags = current->flags;
 	current->flags |= PF_FREEZER_SKIP;
 	mutex_lock(&system_transition_mutex);
+	return flags;
 }
 EXPORT_SYMBOL_GPL(lock_system_sleep);
 
-void unlock_system_sleep(void)
+void unlock_system_sleep(unsigned int flags)
 {
 	/*
 	 * Don't use freezer_count() because we don't want the call to
@@ -46,7 +48,8 @@ void unlock_system_sleep(void)
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	current->flags &= ~PF_FREEZER_SKIP;
+	if (!(flags & PF_FREEZER_SKIP))
+		current->flags &= ~PF_FREEZER_SKIP;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
@@ -263,16 +266,17 @@ static ssize_t pm_test_show(struct kobje
 static ssize_t pm_test_store(struct kobject *kobj, struct kobj_attribute *attr,
 				const char *buf, size_t n)
 {
+	unsigned int sleep_flags;
 	const char * const *s;
+	int error = -EINVAL;
 	int level;
 	char *p;
 	int len;
-	int error = -EINVAL;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	level = TEST_FIRST;
 	for (s = &pm_tests[level]; level <= TEST_MAX; s++, level++)
@@ -282,7 +286,7 @@ static ssize_t pm_test_store(struct kobj
 			break;
 		}
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error ? error : n;
 }
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -75,9 +75,11 @@ EXPORT_SYMBOL_GPL(pm_suspend_default_s2i
 
 void s2idle_set_ops(const struct platform_s2idle_ops *ops)
 {
-	lock_system_sleep();
+	unsigned int sleep_flags;
+
+	sleep_flags = lock_system_sleep();
 	s2idle_ops = ops;
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 
 static void s2idle_begin(void)
@@ -200,7 +202,9 @@ __setup("mem_sleep_default=", mem_sleep_
  */
 void suspend_set_ops(const struct platform_suspend_ops *ops)
 {
-	lock_system_sleep();
+	unsigned int sleep_flags;
+
+	sleep_flags = lock_system_sleep();
 
 	suspend_ops = ops;
 
@@ -216,7 +220,7 @@ void suspend_set_ops(const struct platfo
 			mem_sleep_current = PM_SUSPEND_MEM;
 	}
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 }
 EXPORT_SYMBOL_GPL(suspend_set_ops);
 
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -47,12 +47,13 @@ int is_hibernate_resume_dev(dev_t dev)
 static int snapshot_open(struct inode *inode, struct file *filp)
 {
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 	int error;
 
 	if (!hibernation_available())
 		return -EPERM;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	if (!hibernate_acquire()) {
 		error = -EBUSY;
@@ -98,7 +99,7 @@ static int snapshot_open(struct inode *i
 	data->dev = 0;
 
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return error;
 }
@@ -106,8 +107,9 @@ static int snapshot_open(struct inode *i
 static int snapshot_release(struct inode *inode, struct file *filp)
 {
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	swsusp_free();
 	data = filp->private_data;
@@ -124,7 +126,7 @@ static int snapshot_release(struct inode
 			PM_POST_HIBERNATION : PM_POST_RESTORE);
 	hibernate_release();
 
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return 0;
 }
@@ -132,11 +134,12 @@ static int snapshot_release(struct inode
 static ssize_t snapshot_read(struct file *filp, char __user *buf,
                              size_t count, loff_t *offp)
 {
+	loff_t pg_offp = *offp & ~PAGE_MASK;
 	struct snapshot_data *data;
+	unsigned int sleep_flags;
 	ssize_t res;
-	loff_t pg_offp = *offp & ~PAGE_MASK;
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	data = filp->private_data;
 	if (!data->ready) {
@@ -157,7 +160,7 @@ static ssize_t snapshot_read(struct file
 		*offp += res;
 
  Unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return res;
 }
@@ -165,16 +168,17 @@ static ssize_t snapshot_read(struct file
 static ssize_t snapshot_write(struct file *filp, const char __user *buf,
                               size_t count, loff_t *offp)
 {
+	loff_t pg_offp = *offp & ~PAGE_MASK;
 	struct snapshot_data *data;
+	unsigned long sleep_flags;
 	ssize_t res;
-	loff_t pg_offp = *offp & ~PAGE_MASK;
 
 	if (need_wait) {
 		wait_for_device_probe();
 		need_wait = false;
 	}
 
-	lock_system_sleep();
+	sleep_flags = lock_system_sleep();
 
 	data = filp->private_data;
 
@@ -196,7 +200,7 @@ static ssize_t snapshot_write(struct fil
 	if (res > 0)
 		*offp += res;
 unlock:
-	unlock_system_sleep();
+	unlock_system_sleep(sleep_flags);
 
 	return res;
 }



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
  2022-08-22 11:18 ` [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-08-23 17:28   ` Rafael J. Wysocki
  2022-08-22 11:18 ` [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure
that the UMH, which is going to freeze the system, doesn't
indefinitely wait for it's caller.

Rework things by adding UMH_FREEZABLE to indicate the completion is
freezable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/umh.h     |    9 +++++----
 init/do_mounts_initrd.c |   10 +---------
 kernel/umh.c            |    8 ++++++++
 3 files changed, 14 insertions(+), 13 deletions(-)

--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -11,10 +11,11 @@
 struct cred;
 struct file;
 
-#define UMH_NO_WAIT	0	/* don't wait at all */
-#define UMH_WAIT_EXEC	1	/* wait for the exec, but not the process */
-#define UMH_WAIT_PROC	2	/* wait for the process to complete */
-#define UMH_KILLABLE	4	/* wait for EXEC/PROC killable */
+#define UMH_NO_WAIT	0x00	/* don't wait at all */
+#define UMH_WAIT_EXEC	0x01	/* wait for the exec, but not the process */
+#define UMH_WAIT_PROC	0x02	/* wait for the process to complete */
+#define UMH_KILLABLE	0x04	/* wait for EXEC/PROC killable */
+#define UMH_FREEZABLE	0x08	/* wait for EXEC/PROC freezable */
 
 struct subprocess_info {
 	struct work_struct work;
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -79,19 +79,11 @@ static void __init handle_initrd(void)
 	init_mkdir("/old", 0700);
 	init_chdir("/old");
 
-	/*
-	 * In case that a resume from disk is carried out by linuxrc or one of
-	 * its children, we need to tell the freezer not to wait for us.
-	 */
-	current->flags |= PF_FREEZER_SKIP;
-
 	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
 					 GFP_KERNEL, init_linuxrc, NULL, NULL);
 	if (!info)
 		return;
-	call_usermodehelper_exec(info, UMH_WAIT_PROC);
-
-	current->flags &= ~PF_FREEZER_SKIP;
+	call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
 
 	/* move initrd to rootfs' /old */
 	init_mount("..", ".", NULL, MS_MOVE, NULL);
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -28,6 +28,7 @@
 #include <linux/async.h>
 #include <linux/uaccess.h>
 #include <linux/initrd.h>
+#include <linux/freezer.h>
 
 #include <trace/events/module.h>
 
@@ -436,6 +437,9 @@ int call_usermodehelper_exec(struct subp
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
 
+	if (wait & UMH_FREEZABLE)
+		freezer_do_not_count();
+
 	if (wait & UMH_KILLABLE) {
 		retval = wait_for_completion_killable(&done);
 		if (!retval)
@@ -448,6 +452,10 @@ int call_usermodehelper_exec(struct subp
 	}
 
 	wait_for_completion(&done);
+
+	if (wait & UMH_FREEZABLE)
+		freezer_count();
+
 wait_done:
 	retval = sub_info->retval;
 out:



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
  2022-08-22 11:18 ` [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
  2022-08-22 11:18 ` [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-09-04 10:44   ` Ingo Molnar
  2022-08-22 11:18 ` [PATCH v3 4/6] sched/completion: Add wait_for_completion_state() Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Make wait_task_inactive()'s @match_state work like ttwu()'s @state.

That is, instead of an equal comparison, use it as a mask. This allows
matching multiple block conditions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_running(rq, p)) {
-			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
+			if (match_state && !(READ_ONCE(p->__state) & match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
 		running = task_running(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || READ_ONCE(p->__state) == match_state)
+		if (!match_state || (READ_ONCE(p->__state) & match_state))
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);
 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
                   ` (2 preceding siblings ...)
  2022-08-22 11:18 ` [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-08-23 17:32   ` Rafael J. Wysocki
  2022-09-04 10:46   ` Ingo Molnar
  2022-08-22 11:18 ` [PATCH v3 5/6] sched/wait: Add wait_event_state() Peter Zijlstra
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  5 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Allows waiting with a custom @state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/completion.h |    1 +
 kernel/sched/completion.c  |    9 +++++++++
 2 files changed, 10 insertions(+)

--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
 extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
+extern int wait_for_completion_state(struct completion *x, unsigned int state);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
 						   unsigned long timeout);
 extern unsigned long wait_for_completion_io_timeout(struct completion *x,
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
 }
 EXPORT_SYMBOL(wait_for_completion_killable);
 
+int __sched wait_for_completion_state(struct completion *x, unsigned int state)
+{
+	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
+	if (t == -ERESTARTSYS)
+		return t;
+	return 0;
+}
+EXPORT_SYMBOL(wait_for_completion_state);
+
 /**
  * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
  * @x:  holds the state of this particular completion



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 5/6] sched/wait: Add wait_event_state()
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
                   ` (3 preceding siblings ...)
  2022-08-22 11:18 ` [PATCH v3 4/6] sched/completion: Add wait_for_completion_state() Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-09-04  9:54   ` Ingo Molnar
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  5 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Allows waiting with a custom @state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/wait.h |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -931,6 +931,34 @@ extern int do_wait_intr_irq(wait_queue_h
 	__ret;									\
 })
 
+#define __wait_event_state(wq, condition, state)				\
+	___wait_event(wq, condition, state, 0, 0, schedule())
+
+/**
+ * wait_event_state - sleep until a condition gets true
+ * @wq_head: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @state: state to sleep in
+ *
+ * The process is put to sleep (@state) until the @condition evaluates to true
+ * or a signal is received.  The @condition is checked each time the waitqueue
+ * @wq_head is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function will return -ERESTARTSYS if it was interrupted by a
+ * signal and 0 if @condition evaluated to true.
+ */
+#define wait_event_state(wq_head, condition, state)				\
+({										\
+	int __ret = 0;								\
+	might_sleep();								\
+	if (!(condition))							\
+		__ret = __wait_event_state(wq_head, condition, state);		\
+	__ret;									\
+})
+
 #define __wait_event_killable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
 		      TASK_KILLABLE, 0, timeout,				\



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
                   ` (4 preceding siblings ...)
  2022-08-22 11:18 ` [PATCH v3 5/6] sched/wait: Add wait_event_state() Peter Zijlstra
@ 2022-08-22 11:18 ` Peter Zijlstra
  2022-08-23 17:36   ` Rafael J. Wysocki
                     ` (3 more replies)
  5 siblings, 4 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-22 11:18 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

Rewrite the core freezer to behave better wrt thawing and be simpler
in general.

By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).

Specifically; the current scheme works a little like:

	freezer_do_not_count();
	schedule();
	freezer_count();

And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.

However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.

That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.

This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.

As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:

	TASK_FREEZABLE	-> TASK_FROZEN
	__TASK_STOPPED	-> TASK_FROZEN
	__TASK_TRACED	-> TASK_FROZEN

The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).

The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.

With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/android/binder.c       |    4 
 drivers/media/pci/pt3/pt3.c    |    4 
 fs/cifs/inode.c                |    4 
 fs/cifs/transport.c            |    5 
 fs/coredump.c                  |    5 
 fs/nfs/file.c                  |    3 
 fs/nfs/inode.c                 |   12 --
 fs/nfs/nfs3proc.c              |    3 
 fs/nfs/nfs4proc.c              |   14 +-
 fs/nfs/nfs4state.c             |    3 
 fs/nfs/pnfs.c                  |    4 
 fs/xfs/xfs_trans_ail.c         |    8 -
 include/linux/freezer.h        |  245 +----------------------------------------
 include/linux/sched.h          |   41 +++---
 include/linux/sunrpc/sched.h   |    7 -
 include/linux/wait.h           |   12 +-
 kernel/cgroup/legacy_freezer.c |   23 +--
 kernel/exit.c                  |    4 
 kernel/fork.c                  |    5 
 kernel/freezer.c               |  133 ++++++++++++++++------
 kernel/futex/waitwake.c        |    8 -
 kernel/hung_task.c             |    4 
 kernel/power/main.c            |    6 -
 kernel/power/process.c         |   10 -
 kernel/ptrace.c                |    2 
 kernel/sched/core.c            |    2 
 kernel/signal.c                |   14 +-
 kernel/time/hrtimer.c          |    4 
 kernel/umh.c                   |   20 +--
 mm/khugepaged.c                |    4 
 net/sunrpc/sched.c             |   12 --
 net/unix/af_unix.c             |    8 -
 32 files changed, 224 insertions(+), 409 deletions(-)

--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -4247,10 +4247,9 @@ static int binder_wait_for_work(struct b
 	struct binder_proc *proc = thread->proc;
 	int ret = 0;
 
-	freezer_do_not_count();
 	binder_inner_proc_lock(proc);
 	for (;;) {
-		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		if (binder_has_work_ilocked(thread, do_proc_work))
 			break;
 		if (do_proc_work)
@@ -4267,7 +4266,6 @@ static int binder_wait_for_work(struct b
 	}
 	finish_wait(&thread->wait, &wait);
 	binder_inner_proc_unlock(proc);
-	freezer_count();
 
 	return ret;
 }
--- a/drivers/media/pci/pt3/pt3.c
+++ b/drivers/media/pci/pt3/pt3.c
@@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
 		pt3_proc_dma(adap);
 
 		delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		freezable_schedule_hrtimeout_range(&delay,
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_hrtimeout_range(&delay,
 					PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
 					HRTIMER_MODE_REL);
 	}
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2327,7 +2327,7 @@ cifs_invalidate_mapping(struct inode *in
 static int
 cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -2345,7 +2345,7 @@ cifs_revalidate_mapping(struct inode *in
 		return 0;
 
 	rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
-				     TASK_KILLABLE);
+				     TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (rc)
 		return rc;
 
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -757,8 +757,9 @@ wait_for_response(struct TCP_Server_Info
 {
 	int error;
 
-	error = wait_event_freezekillable_unsafe(server->response_q,
-				    midQ->mid_state != MID_REQUEST_SUBMITTED);
+	error = wait_event_state(server->response_q,
+				 midQ->mid_state != MID_REQUEST_SUBMITTED,
+				 (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
 	if (error < 0)
 		return -ERESTARTSYS;
 
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -402,9 +402,8 @@ static int coredump_wait(int exit_code,
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
-		freezer_do_not_count();
-		wait_for_completion(&core_state->startup);
-		freezer_count();
+		wait_for_completion_state(&core_state->startup,
+					  TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 		/*
 		 * Wait for all the threads to become inactive, so that
 		 * all the thread context (extended register state, like
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -570,7 +570,8 @@ static vm_fault_t nfs_vm_page_mkwrite(st
 	}
 
 	wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
+			   nfs_wait_bit_killable,
+			   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 
 	lock_page(page);
 	mapping = page_file_mapping(page);
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat
 	return nfs_fileid_to_ino_t(fattr->fileid);
 }
 
-static int nfs_wait_killable(int mode)
+int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
 }
-
-int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
-{
-	return nfs_wait_killable(mode);
-}
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
 /**
@@ -1331,7 +1326,8 @@ int nfs_clear_invalid_mapping(struct add
 	 */
 	for (;;) {
 		ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
-					 nfs_wait_bit_killable, TASK_KILLABLE);
+					 nfs_wait_bit_killable,
+					 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (ret)
 			goto out;
 		spin_lock(&inode->i_lock);
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt,
 		res = rpc_call_sync(clnt, msg, flags);
 		if (res != -EJUKEBOX)
 			break;
-		freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+		__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+		schedule_timeout(NFS_JUKEBOX_RETRY_TIME);
 		res = -ERESTARTSYS;
 	} while (!fatal_signal_pending(current));
 	return res;
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -416,8 +416,8 @@ static int nfs4_delay_killable(long *tim
 {
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
+	__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!__fatal_signal_pending(current))
 		return 0;
 	return -EINTR;
@@ -427,7 +427,8 @@ static int nfs4_delay_interruptible(long
 {
 	might_sleep();
 
-	freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!signal_pending(current))
 		return 0;
 	return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
@@ -7406,7 +7407,8 @@ nfs4_retry_setlk_simple(struct nfs4_stat
 		status = nfs4_proc_setlk(state, cmd, request);
 		if ((status != -EAGAIN) || IS_SETLK(cmd))
 			break;
-		freezable_schedule_timeout_interruptible(timeout);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_timeout(timeout);
 		timeout *= 2;
 		timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
 		status = -ERESTARTSYS;
@@ -7474,10 +7476,8 @@ nfs4_retry_setlk(struct nfs4_state *stat
 			break;
 
 		status = -ERESTARTSYS;
-		freezer_do_not_count();
-		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
+		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 			   NFS4_LOCK_MAXTIMEOUT);
-		freezer_count();
 	} while (!signalled());
 
 	remove_wait_queue(q, &waiter.wait);
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1314,7 +1314,8 @@ int nfs4_wait_clnt_recover(struct nfs_cl
 
 	refcount_inc(&clp->cl_count);
 	res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
-				 nfs_wait_bit_killable, TASK_KILLABLE);
+				 nfs_wait_bit_killable,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (res)
 		goto out;
 	if (clp->cl_cons_state < 0)
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1908,7 +1908,7 @@ static int pnfs_prepare_to_retry_layoutg
 	pnfs_layoutcommit_inode(lo->plh_inode, false);
 	return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
 				   nfs_wait_bit_killable,
-				   TASK_KILLABLE);
+				   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
 
 static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
@@ -3193,7 +3193,7 @@ pnfs_layoutcommit_inode(struct inode *in
 		status = wait_on_bit_lock_action(&nfsi->flags,
 				NFS_INO_LAYOUTCOMMITTING,
 				nfs_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (status)
 			goto out;
 	}
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -602,9 +602,9 @@ xfsaild(
 
 	while (1) {
 		if (tout && tout <= 20)
-			set_current_state(TASK_KILLABLE);
+			set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
 		else
-			set_current_state(TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 		/*
 		 * Check kthread_should_stop() after we set the task state to
@@ -653,14 +653,14 @@ xfsaild(
 		    ailp->ail_target == ailp->ail_target_prev &&
 		    list_empty(&ailp->ail_buf_list)) {
 			spin_unlock(&ailp->ail_lock);
-			freezable_schedule();
+			schedule();
 			tout = 0;
 			continue;
 		}
 		spin_unlock(&ailp->ail_lock);
 
 		if (tout)
-			freezable_schedule_timeout(msecs_to_jiffies(tout));
+			schedule_timeout(msecs_to_jiffies(tout));
 
 		__set_current_state(TASK_RUNNING);
 
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs
 /*
  * Check if a process has been frozen
  */
-static inline bool frozen(struct task_struct *p)
-{
-	return p->flags & PF_FROZEN;
-}
+extern bool frozen(struct task_struct *p);
 
 extern bool freezing_slow_path(struct task_struct *p);
 
@@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct ta
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
 extern void thaw_kernel_threads(void);
 
-/*
- * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
- * If try_to_freeze causes a lockdep warning it means the caller may deadlock
- */
-static inline bool try_to_freeze_unsafe(void)
+static inline bool try_to_freeze(void)
 {
 	might_sleep();
 	if (likely(!freezing(current)))
 		return false;
-	return __refrigerator(false);
-}
-
-static inline bool try_to_freeze(void)
-{
 	if (!(current->flags & PF_NOFREEZE))
 		debug_check_no_locks_held();
-	return try_to_freeze_unsafe();
+	return __refrigerator(false);
 }
 
 extern bool freeze_task(struct task_struct *p);
@@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struc
 }
 #endif /* !CONFIG_CGROUP_FREEZER */
 
-/*
- * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
- * calls wait_for_completion(&vfork) and reset right after it returns from this
- * function.  Next, the parent should call try_to_freeze() to freeze itself
- * appropriately in case the child has exited before the freezing of tasks is
- * complete.  However, we don't want kernel threads to be frozen in unexpected
- * places, so we allow them to block freeze_processes() instead or to set
- * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
- * parent won't really block freeze_processes(), since ____call_usermodehelper()
- * (the child) does a little before exec/exit and it can't be frozen before
- * waking up the parent.
- */
-
-
-/**
- * freezer_do_not_count - tell freezer to ignore %current
- *
- * Tell freezers to ignore the current task when determining whether the
- * target frozen state is reached.  IOW, the current task will be
- * considered frozen enough by freezers.
- *
- * The caller shouldn't do anything which isn't allowed for a frozen task
- * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
- * wrap a scheduling operation and nothing much else.
- */
-static inline void freezer_do_not_count(void)
-{
-	current->flags |= PF_FREEZER_SKIP;
-}
-
-/**
- * freezer_count - tell freezer to stop ignoring %current
- *
- * Undo freezer_do_not_count().  It tells freezers that %current should be
- * considered again and tries to freeze if freezing condition is already in
- * effect.
- */
-static inline void freezer_count(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezer_count_unsafe(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	smp_mb();
-	try_to_freeze_unsafe();
-}
-
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
-/*
- * These functions are intended to be used whenever you want allow a sleeping
- * task to be frozen. Note that neither return any clear indication of
- * whether a freeze event happened while in this function.
- */
-
-/* Like schedule(), but should not block the freezer. */
-static inline void freezable_schedule(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezable_schedule_unsafe(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count_unsafe();
-}
-
-/*
- * Like schedule_timeout(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout_interruptible(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
-{
-	long __retval;
-
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/* Like schedule_timeout_killable(), but should not block the freezer. */
-static inline long freezable_schedule_timeout_killable(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/*
- * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
-		u64 delta, const enum hrtimer_mode mode)
-{
-	int __retval;
-	freezer_do_not_count();
-	__retval = schedule_hrtimeout_range(expires, delta, mode);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Freezer-friendly wrappers around wait_event_interruptible(),
- * wait_event_killable() and wait_event_interruptible_timeout(), originally
- * defined in <linux/wait.h>
- */
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-({									\
-	int __retval;							\
-	freezer_do_not_count();						\
-	__retval = wait_event_killable(wq, (condition));		\
-	freezer_count_unsafe();						\
-	__retval;							\
-})
-
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
@@ -281,35 +83,8 @@ static inline void thaw_kernel_threads(v
 
 static inline bool try_to_freeze(void) { return false; }
 
-static inline void freezer_do_not_count(void) {}
-static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
-#define freezable_schedule()  schedule()
-
-#define freezable_schedule_unsafe()  schedule()
-
-#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
-
-#define freezable_schedule_timeout_interruptible(timeout)		\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_interruptible_unsafe(timeout)	\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_killable(timeout)			\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_timeout_killable_unsafe(timeout)		\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_hrtimeout_range(expires, delta, mode)	\
-	schedule_hrtimeout_range(expires, delta, mode)
-
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-		wait_event_killable(wq, condition)
-
 #endif /* !CONFIG_FREEZER */
 
 #endif	/* FREEZER_H_INCLUDED */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -81,25 +81,32 @@ struct task_group;
  */
 
 /* Used in tsk->state: */
-#define TASK_RUNNING			0x0000
-#define TASK_INTERRUPTIBLE		0x0001
-#define TASK_UNINTERRUPTIBLE		0x0002
-#define __TASK_STOPPED			0x0004
-#define __TASK_TRACED			0x0008
+#define TASK_RUNNING			0x000000
+#define TASK_INTERRUPTIBLE		0x000001
+#define TASK_UNINTERRUPTIBLE		0x000002
+#define __TASK_STOPPED			0x000004
+#define __TASK_TRACED			0x000008
 /* Used in tsk->exit_state: */
-#define EXIT_DEAD			0x0010
-#define EXIT_ZOMBIE			0x0020
+#define EXIT_DEAD			0x000010
+#define EXIT_ZOMBIE			0x000020
 #define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
 /* Used in tsk->state again: */
-#define TASK_PARKED			0x0040
-#define TASK_DEAD			0x0080
-#define TASK_WAKEKILL			0x0100
-#define TASK_WAKING			0x0200
-#define TASK_NOLOAD			0x0400
-#define TASK_NEW			0x0800
-/* RT specific auxilliary flag to mark RT lock waiters */
-#define TASK_RTLOCK_WAIT		0x1000
-#define TASK_STATE_MAX			0x2000
+#define TASK_PARKED			0x000040
+#define TASK_DEAD			0x000080
+#define TASK_WAKEKILL			0x000100
+#define TASK_WAKING			0x000200
+#define TASK_NOLOAD			0x000400
+#define TASK_NEW			0x000800
+#define TASK_FREEZABLE			0x001000
+#define __TASK_FREEZABLE_UNSAFE	       (0x002000 * IS_ENABLED(CONFIG_LOCKDEP))
+#define TASK_FROZEN			0x004000
+#define TASK_RTLOCK_WAIT		0x008000
+#define TASK_STATE_MAX			0x010000
+
+/*
+ * DO NOT ADD ANY NEW USERS !
+ */
+#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1714,7 +1721,6 @@ extern struct pid *cad_pid;
 #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1725,7 +1731,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -252,7 +252,7 @@ int		rpc_malloc(struct rpc_task *);
 void		rpc_free(struct rpc_task *);
 int		rpciod_up(void);
 void		rpciod_down(void);
-int		__rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int		rpc_wait_for_completion_task(struct rpc_task *task);
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 struct net;
 void		rpc_show_tasks(struct net *);
@@ -264,11 +264,6 @@ extern struct workqueue_struct *xprtiod_
 void		rpc_prepare_task(struct rpc_task *task);
 gfp_t		rpc_task_gfp_mask(void);
 
-static inline int rpc_wait_for_completion_task(struct rpc_task *task)
-{
-	return __rpc_wait_for_completion_task(task, NULL);
-}
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
 static inline const char * rpc_qname(const struct rpc_wait_queue *q)
 {
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -361,8 +361,8 @@ do {										\
 } while (0)
 
 #define __wait_event_freezable(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-			    freezable_schedule())
+	___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),	\
+			0, 0, schedule())
 
 /**
  * wait_event_freezable - sleep (or freeze) until a condition gets true
@@ -420,8 +420,8 @@ do {										\
 
 #define __wait_event_freezable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
-		      TASK_INTERRUPTIBLE, 0, timeout,				\
-		      __ret = freezable_schedule_timeout(__ret))
+		      (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,		\
+		      __ret = schedule_timeout(__ret))
 
 /*
  * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
@@ -642,8 +642,8 @@ do {										\
 
 
 #define __wait_event_freezable_exclusive(wq, condition)				\
-	___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,			\
-			freezable_schedule())
+	___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
+			schedule())
 
 #define wait_event_freezable_exclusive(wq, condition)				\
 ({										\
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -113,7 +113,7 @@ static int freezer_css_online(struct cgr
 
 	if (parent && (parent->state & CGROUP_FREEZING)) {
 		freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 	}
 
 	mutex_unlock(&freezer_mutex);
@@ -134,7 +134,7 @@ static void freezer_css_offline(struct c
 	mutex_lock(&freezer_mutex);
 
 	if (freezer->state & CGROUP_FREEZING)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 
 	freezer->state = 0;
 
@@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup
 			__thaw_task(task);
 		} else {
 			freeze_task(task);
+
 			/* clear FROZEN and propagate upwards */
 			while (freezer && (freezer->state & CGROUP_FROZEN)) {
 				freezer->state &= ~CGROUP_FROZEN;
@@ -271,16 +272,8 @@ static void update_if_frozen(struct cgro
 	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
-		if (freezing(task)) {
-			/*
-			 * freezer_should_skip() indicates that the task
-			 * should be skipped when determining freezing
-			 * completion.  Consider it frozen in addition to
-			 * the usual frozen condition.
-			 */
-			if (!frozen(task) && !freezer_should_skip(task))
-				goto out_iter_end;
-		}
+		if (freezing(task) && !frozen(task))
+			goto out_iter_end;
 	}
 
 	freezer->state |= CGROUP_FROZEN;
@@ -357,7 +350,7 @@ static void freezer_apply_state(struct f
 
 	if (freeze) {
 		if (!(freezer->state & CGROUP_FREEZING))
-			atomic_inc(&system_freezing_cnt);
+			static_branch_inc(&freezer_active);
 		freezer->state |= state;
 		freeze_cgroup(freezer);
 	} else {
@@ -366,9 +359,9 @@ static void freezer_apply_state(struct f
 		freezer->state &= ~state;
 
 		if (!(freezer->state & CGROUP_FREEZING)) {
-			if (was_freezing)
-				atomic_dec(&system_freezing_cnt);
 			freezer->state &= ~CGROUP_FROZEN;
+			if (was_freezing)
+				static_branch_dec(&freezer_active);
 			unfreeze_cgroup(freezer);
 		}
 	}
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -374,10 +374,10 @@ static void coredump_task_exit(struct ta
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
-			freezable_schedule();
+			schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 	}
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1420,13 +1420,12 @@ static void complete_vfork_done(struct t
 static int wait_for_vfork_done(struct task_struct *child,
 				struct completion *vfork)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE;
 	int killed;
 
-	freezer_do_not_count();
 	cgroup_enter_frozen();
-	killed = wait_for_completion_killable(vfork);
+	killed = wait_for_completion_state(vfork, state);
 	cgroup_leave_frozen(false);
-	freezer_count();
 
 	if (killed) {
 		task_lock(child);
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,10 +13,11 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
-/* indicate whether PM freezing is in effect, protected by
+/*
+ * indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
  */
 bool pm_freezing;
@@ -29,7 +30,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -52,41 +53,40 @@ bool freezing_slow_path(struct task_stru
 }
 EXPORT_SYMBOL(freezing_slow_path);
 
+bool frozen(struct task_struct *p)
+{
+	return READ_ONCE(p->__state) & TASK_FROZEN;
+}
+
 /* Refrigerator is place where frozen processes are stored :-). */
 bool __refrigerator(bool check_kthr_stop)
 {
-	/* Hmm, should we be allowed to suspend when there are realtime
-	   processes around? */
+	unsigned int state = get_current_state();
 	bool was_frozen = false;
-	unsigned int save = get_current_state();
 
 	pr_debug("%s entered refrigerator\n", current->comm);
 
+	WARN_ON_ONCE(state && !(state & TASK_NORMAL));
+
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		bool freeze;
+
+		set_current_state(TASK_FROZEN);
 
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
+		if (!freeze)
 			break;
+
 		was_frozen = true;
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-	/*
-	 * Restore saved task state before returning.  The mb'd version
-	 * needs to be used; otherwise, it might silently break
-	 * synchronization which depends on ordered task state change.
-	 */
-	set_current_state(save);
-
 	return was_frozen;
 }
 EXPORT_SYMBOL(__refrigerator);
@@ -101,6 +101,44 @@ static void fake_signal_wake_up(struct t
 	}
 }
 
+static int __set_task_frozen(struct task_struct *p, void *arg)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if (p->on_rq)
+		return 0;
+
+	if (p != current && task_curr(p))
+		return 0;
+
+	if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED)))
+		return 0;
+
+	/*
+	 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they
+	 * can suffer spurious wakeups.
+	 */
+	if (state & TASK_FREEZABLE)
+		WARN_ON_ONCE(!(state & TASK_NORMAL));
+
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It's dangerous to freeze with locks held; there be dragons there.
+	 */
+	if (!(state & __TASK_FREEZABLE_UNSAFE))
+		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+#endif
+
+	WRITE_ONCE(p->__state, TASK_FROZEN);
+	return TASK_FROZEN;
+}
+
+static bool __freeze_task(struct task_struct *p)
+{
+	/* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */
+	return task_call_func(p, __set_task_frozen, NULL);
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +154,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -137,19 +163,52 @@ bool freeze_task(struct task_struct *p)
 	if (!(p->flags & PF_KTHREAD))
 		fake_signal_wake_up(p);
 	else
-		wake_up_state(p, TASK_INTERRUPTIBLE);
+		wake_up_state(p, TASK_NORMAL);
 
 	spin_unlock_irqrestore(&freezer_lock, flags);
 	return true;
 }
 
+/*
+ * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical
+ * state in p->jobctl. If either of them got a wakeup that was missed because
+ * TASK_FROZEN, then their canonical state reflects that and the below will
+ * refuse to restore the special state and instead issue the wakeup.
+ */
+static int __set_task_special(struct task_struct *p, void *arg)
+{
+	unsigned int state = 0;
+
+	if (p->jobctl & JOBCTL_TRACED)
+		state = TASK_TRACED;
+
+	else if (p->jobctl & JOBCTL_STOPPED)
+		state = TASK_STOPPED;
+
+	if (state)
+		WRITE_ONCE(p->__state, state);
+
+	return state;
+}
+
 void __thaw_task(struct task_struct *p)
 {
-	unsigned long flags;
+	unsigned long flags, flags2;
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
-		wake_up_process(p);
+	if (WARN_ON_ONCE(freezing(p)))
+		goto unlock;
+
+	if (lock_task_sighand(p, &flags2)) {
+		/* TASK_FROZEN -> TASK_{STOPPED,TRACED} */
+		bool ret = task_call_func(p, __set_task_special, NULL);
+		unlock_task_sighand(p, &flags2);
+		if (ret)
+			goto unlock;
+	}
+
+	wake_up_state(p, TASK_FROZEN);
+unlock:
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -334,7 +334,7 @@ void futex_wait_queue(struct futex_hash_
 	 * futex_queue() calls spin_unlock() upon completion, both serializing
 	 * access to the hash list and forcing another memory barrier.
 	 */
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	futex_queue(q, hb);
 
 	/* Arm the timer */
@@ -352,7 +352,7 @@ void futex_wait_queue(struct futex_hash_
 		 * is no timeout, or if it has yet to expire.
 		 */
 		if (!timeout || timeout->task)
-			freezable_schedule();
+			schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 }
@@ -430,7 +430,7 @@ static int futex_wait_multiple_setup(str
 			return ret;
 	}
 
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 	for (i = 0; i < count; i++) {
 		u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
@@ -504,7 +504,7 @@ static void futex_sleep_multiple(struct
 			return;
 	}
 
-	freezable_schedule();
+	schedule();
 }
 
 /**
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -95,8 +95,8 @@ static void check_hung_task(struct task_
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(READ_ONCE(t->__state) & (TASK_FREEZABLE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -24,7 +24,7 @@
 unsigned int lock_system_sleep(void)
 {
 	unsigned int flags = current->flags;
-	current->flags |= PF_FREEZER_SKIP;
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 	return flags;
 }
@@ -48,8 +48,8 @@ void unlock_system_sleep(unsigned int fl
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	if (!(flags & PF_FREEZER_SKIP))
-		current->flags &= ~PF_FREEZER_SKIP;
+	if (!(flags & PF_NOFREEZE))
+		current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -50,8 +50,7 @@ static int try_to_freeze_tasks(bool user
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -96,8 +95,7 @@ static int try_to_freeze_tasks(bool user
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -129,7 +127,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(0);
 	pr_info("Freezing user space processes ... ");
@@ -190,7 +188,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -269,7 +269,7 @@ static int ptrace_check_attach(struct ta
 	read_unlock(&tasklist_lock);
 
 	if (!ret && !ignore_state &&
-	    WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
+	    WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED|TASK_FROZEN)))
 		ret = -ESRCH;
 
 	return ret;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6429,7 +6429,7 @@ static void __sched notrace __schedule(u
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2304,7 +2304,7 @@ static int ptrace_stop(int exit_code, in
 	read_unlock(&tasklist_lock);
 	cgroup_enter_frozen();
 	preempt_enable_no_resched();
-	freezable_schedule();
+	schedule();
 	cgroup_leave_frozen(true);
 
 	/*
@@ -2473,7 +2473,7 @@ static bool do_signal_stop(int signr)
 
 		/* Now we don't run again until woken by SIGCONT or SIGKILL */
 		cgroup_enter_frozen();
-		freezable_schedule();
+		schedule();
 		return true;
 	} else {
 		/*
@@ -2548,11 +2548,11 @@ static void do_freezer_trap(void)
 	 * immediately (if there is a non-fatal signal pending), and
 	 * put the task into sleep.
 	 */
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	clear_thread_flag(TIF_SIGPENDING);
 	spin_unlock_irq(&current->sighand->siglock);
 	cgroup_enter_frozen();
-	freezable_schedule();
+	schedule();
 }
 
 static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
@@ -3600,9 +3600,9 @@ static int do_sigtimedwait(const sigset_
 		recalc_sigpending();
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		__set_current_state(TASK_INTERRUPTIBLE);
-		ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
-							 HRTIMER_MODE_REL);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
+					       HRTIMER_MODE_REL);
 		spin_lock_irq(&tsk->sighand->siglock);
 		__set_task_blocked(tsk, &tsk->real_blocked);
 		sigemptyset(&tsk->real_blocked);
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2037,11 +2037,11 @@ static int __sched do_nanosleep(struct h
 	struct restart_block *restart;
 
 	do {
-		set_current_state(TASK_INTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		hrtimer_sleeper_start_expires(t, mode);
 
 		if (likely(t->task))
-			freezable_schedule();
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -404,6 +404,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup)
  */
 int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE;
 	DECLARE_COMPLETION_ONSTACK(done);
 	int retval = 0;
 
@@ -437,25 +438,22 @@ int call_usermodehelper_exec(struct subp
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
 
+	if (wait & UMH_KILLABLE)
+		state |= TASK_KILLABLE;
+
 	if (wait & UMH_FREEZABLE)
-		freezer_do_not_count();
+		state |= TASK_FREEZABLE;
 
-	if (wait & UMH_KILLABLE) {
-		retval = wait_for_completion_killable(&done);
-		if (!retval)
-			goto wait_done;
+	retval = wait_for_completion_state(&done, state);
+	if (!retval)
+		goto wait_done;
 
+	if (wait & UMH_KILLABLE) {
 		/* umh_complete() will see NULL and free sub_info */
 		if (xchg(&sub_info->complete, NULL))
 			goto unlock;
-		/* fallthrough, umh_complete() was already called */
 	}
 
-	wait_for_completion(&done);
-
-	if (wait & UMH_FREEZABLE)
-		freezer_count();
-
 wait_done:
 	retval = sub_info->retval;
 out:
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -730,8 +730,8 @@ static void khugepaged_alloc_sleep(void)
 	DEFINE_WAIT(wait);
 
 	add_wait_queue(&khugepaged_wait, &wait);
-	freezable_schedule_timeout_interruptible(
-		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -269,7 +269,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -333,14 +333,12 @@ static int rpc_complete_task(struct rpc_
  * to enforce taking of the wq->lock and hence avoid races with
  * rpc_complete_task().
  */
-int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
+int rpc_wait_for_completion_task(struct rpc_task *task)
 {
-	if (action == NULL)
-		action = rpc_wait_bit_killable;
 	return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
-			action, TASK_KILLABLE);
+			rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
-EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
+EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
 
 /*
  * Make an RPC task runnable.
@@ -964,7 +962,7 @@ static void __rpc_execute(struct rpc_tas
 		trace_rpc_task_sync_sleep(task, task->tk_action);
 		status = out_of_line_wait_on_bit(&task->tk_runstate,
 				RPC_TASK_QUEUED, rpc_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE);
 		if (status < 0) {
 			/*
 			 * When a sync task receives a signal, it exits with
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2543,13 +2543,14 @@ static long unix_stream_data_wait(struct
 				  struct sk_buff *last, unsigned int last_len,
 				  bool freezable)
 {
+	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, state);
 
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (tail != last ||
@@ -2562,10 +2563,7 @@ static long unix_stream_data_wait(struct
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		if (freezable)
-			timeo = freezable_schedule_timeout(timeo);
-		else
-			timeo = schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags
  2022-08-22 11:18 ` [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
@ 2022-08-23 17:25   ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2022-08-23 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Eric Biederman,
	Sebastian Andrzej Siewior, Will Deacon,
	Linux Kernel Mailing List, Tejun Heo, Linux PM

On Mon, Aug 22, 2022 at 1:48 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Rafael explained that the reason for having both PF_NOFREEZE and
> PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from
> kthread context that has previously called set_freezable().
>
> In preparation of merging the flags, have {,un}lock_system_slee() save
> and restore current->flags.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/x86/s2idle.c         |   12 ++++++++----
>  drivers/scsi/scsi_transport_spi.c |    7 ++++---
>  include/linux/suspend.h           |    8 ++++----
>  kernel/power/hibernate.c          |   35 ++++++++++++++++++++++-------------
>  kernel/power/main.c               |   16 ++++++++++------
>  kernel/power/suspend.c            |   12 ++++++++----
>  kernel/power/user.c               |   24 ++++++++++++++----------
>  7 files changed, 70 insertions(+), 44 deletions(-)
>
> --- a/drivers/acpi/x86/s2idle.c
> +++ b/drivers/acpi/x86/s2idle.c
> @@ -541,12 +541,14 @@ void acpi_s2idle_setup(void)
>
>  int acpi_register_lps0_dev(struct acpi_s2idle_dev_ops *arg)
>  {
> +       unsigned int sleep_flags;
> +
>         if (!lps0_device_handle || sleep_no_lps0)
>                 return -ENODEV;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>         list_add(&arg->list_node, &lps0_s2idle_devops_head);
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return 0;
>  }
> @@ -554,12 +556,14 @@ EXPORT_SYMBOL_GPL(acpi_register_lps0_dev
>
>  void acpi_unregister_lps0_dev(struct acpi_s2idle_dev_ops *arg)
>  {
> +       unsigned int sleep_flags;
> +
>         if (!lps0_device_handle || sleep_no_lps0)
>                 return;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>         list_del(&arg->list_node);
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>  }
>  EXPORT_SYMBOL_GPL(acpi_unregister_lps0_dev);
>
> --- a/drivers/scsi/scsi_transport_spi.c
> +++ b/drivers/scsi/scsi_transport_spi.c
> @@ -998,8 +998,9 @@ void
>  spi_dv_device(struct scsi_device *sdev)
>  {
>         struct scsi_target *starget = sdev->sdev_target;
> -       u8 *buffer;
>         const int len = SPI_MAX_ECHO_BUFFER_SIZE*2;
> +       unsigned int sleep_flags;
> +       u8 *buffer;
>
>         /*
>          * Because this function and the power management code both call
> @@ -1007,7 +1008,7 @@ spi_dv_device(struct scsi_device *sdev)
>          * while suspend or resume is in progress. Hence the
>          * lock/unlock_system_sleep() calls.
>          */
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         if (scsi_autopm_get_device(sdev))
>                 goto unlock_system_sleep;
> @@ -1058,7 +1059,7 @@ spi_dv_device(struct scsi_device *sdev)
>         scsi_autopm_put_device(sdev);
>
>  unlock_system_sleep:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>  }
>  EXPORT_SYMBOL(spi_dv_device);
>
> --- a/include/linux/suspend.h
> +++ b/include/linux/suspend.h
> @@ -510,8 +510,8 @@ extern bool pm_save_wakeup_count(unsigne
>  extern void pm_wakep_autosleep_enabled(bool set);
>  extern void pm_print_active_wakeup_sources(void);
>
> -extern void lock_system_sleep(void);
> -extern void unlock_system_sleep(void);
> +extern unsigned int lock_system_sleep(void);
> +extern void unlock_system_sleep(unsigned int);
>
>  #else /* !CONFIG_PM_SLEEP */
>
> @@ -534,8 +534,8 @@ static inline void pm_system_wakeup(void
>  static inline void pm_wakeup_clear(bool reset) {}
>  static inline void pm_system_irq_wakeup(unsigned int irq_number) {}
>
> -static inline void lock_system_sleep(void) {}
> -static inline void unlock_system_sleep(void) {}
> +static inline unsigned int lock_system_sleep(void) { return 0; }
> +static inline void unlock_system_sleep(unsigned int flags) {}
>
>  #endif /* !CONFIG_PM_SLEEP */
>
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -92,20 +92,24 @@ bool hibernation_available(void)
>   */
>  void hibernation_set_ops(const struct platform_hibernation_ops *ops)
>  {
> +       unsigned int sleep_flags;
> +
>         if (ops && !(ops->begin && ops->end &&  ops->pre_snapshot
>             && ops->prepare && ops->finish && ops->enter && ops->pre_restore
>             && ops->restore_cleanup && ops->leave)) {
>                 WARN_ON(1);
>                 return;
>         }
> -       lock_system_sleep();
> +
> +       sleep_flags = lock_system_sleep();
> +
>         hibernation_ops = ops;
>         if (ops)
>                 hibernation_mode = HIBERNATION_PLATFORM;
>         else if (hibernation_mode == HIBERNATION_PLATFORM)
>                 hibernation_mode = HIBERNATION_SHUTDOWN;
>
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>  }
>  EXPORT_SYMBOL_GPL(hibernation_set_ops);
>
> @@ -713,6 +717,7 @@ static int load_image_and_restore(void)
>  int hibernate(void)
>  {
>         bool snapshot_test = false;
> +       unsigned int sleep_flags;
>         int error;
>
>         if (!hibernation_available()) {
> @@ -720,7 +725,7 @@ int hibernate(void)
>                 return -EPERM;
>         }
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>         /* The snapshot device should not be opened while we're running */
>         if (!hibernate_acquire()) {
>                 error = -EBUSY;
> @@ -794,7 +799,7 @@ int hibernate(void)
>         pm_restore_console();
>         hibernate_release();
>   Unlock:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>         pr_info("hibernation exit\n");
>
>         return error;
> @@ -809,9 +814,10 @@ int hibernate(void)
>   */
>  int hibernate_quiet_exec(int (*func)(void *data), void *data)
>  {
> +       unsigned int sleep_flags;
>         int error;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         if (!hibernate_acquire()) {
>                 error = -EBUSY;
> @@ -891,7 +897,7 @@ int hibernate_quiet_exec(int (*func)(voi
>         hibernate_release();
>
>  unlock:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return error;
>  }
> @@ -1100,11 +1106,12 @@ static ssize_t disk_show(struct kobject
>  static ssize_t disk_store(struct kobject *kobj, struct kobj_attribute *attr,
>                           const char *buf, size_t n)
>  {
> +       int mode = HIBERNATION_INVALID;
> +       unsigned int sleep_flags;
>         int error = 0;
> -       int i;
>         int len;
>         char *p;
> -       int mode = HIBERNATION_INVALID;
> +       int i;
>
>         if (!hibernation_available())
>                 return -EPERM;
> @@ -1112,7 +1119,7 @@ static ssize_t disk_store(struct kobject
>         p = memchr(buf, '\n', n);
>         len = p ? p - buf : n;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>         for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
>                 if (len == strlen(hibernation_modes[i])
>                     && !strncmp(buf, hibernation_modes[i], len)) {
> @@ -1142,7 +1149,7 @@ static ssize_t disk_store(struct kobject
>         if (!error)
>                 pm_pr_dbg("Hibernation mode set to '%s'\n",
>                                hibernation_modes[mode]);
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>         return error ? error : n;
>  }
>
> @@ -1158,9 +1165,10 @@ static ssize_t resume_show(struct kobjec
>  static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr,
>                             const char *buf, size_t n)
>  {
> -       dev_t res;
> +       unsigned int sleep_flags;
>         int len = n;
>         char *name;
> +       dev_t res;
>
>         if (len && buf[len-1] == '\n')
>                 len--;
> @@ -1173,9 +1181,10 @@ static ssize_t resume_store(struct kobje
>         if (!res)
>                 return -EINVAL;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>         swsusp_resume_device = res;
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
> +
>         pm_pr_dbg("Configured hibernation resume from disk to %u\n",
>                   swsusp_resume_device);
>         noresume = 0;
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -21,14 +21,16 @@
>
>  #ifdef CONFIG_PM_SLEEP
>
> -void lock_system_sleep(void)
> +unsigned int lock_system_sleep(void)
>  {
> +       unsigned int flags = current->flags;
>         current->flags |= PF_FREEZER_SKIP;
>         mutex_lock(&system_transition_mutex);
> +       return flags;
>  }
>  EXPORT_SYMBOL_GPL(lock_system_sleep);
>
> -void unlock_system_sleep(void)
> +void unlock_system_sleep(unsigned int flags)
>  {
>         /*
>          * Don't use freezer_count() because we don't want the call to
> @@ -46,7 +48,8 @@ void unlock_system_sleep(void)
>          * Which means, if we use try_to_freeze() here, it would make them
>          * enter the refrigerator, thus causing hibernation to lockup.
>          */
> -       current->flags &= ~PF_FREEZER_SKIP;
> +       if (!(flags & PF_FREEZER_SKIP))
> +               current->flags &= ~PF_FREEZER_SKIP;
>         mutex_unlock(&system_transition_mutex);
>  }
>  EXPORT_SYMBOL_GPL(unlock_system_sleep);
> @@ -263,16 +266,17 @@ static ssize_t pm_test_show(struct kobje
>  static ssize_t pm_test_store(struct kobject *kobj, struct kobj_attribute *attr,
>                                 const char *buf, size_t n)
>  {
> +       unsigned int sleep_flags;
>         const char * const *s;
> +       int error = -EINVAL;
>         int level;
>         char *p;
>         int len;
> -       int error = -EINVAL;
>
>         p = memchr(buf, '\n', n);
>         len = p ? p - buf : n;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         level = TEST_FIRST;
>         for (s = &pm_tests[level]; level <= TEST_MAX; s++, level++)
> @@ -282,7 +286,7 @@ static ssize_t pm_test_store(struct kobj
>                         break;
>                 }
>
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return error ? error : n;
>  }
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -75,9 +75,11 @@ EXPORT_SYMBOL_GPL(pm_suspend_default_s2i
>
>  void s2idle_set_ops(const struct platform_s2idle_ops *ops)
>  {
> -       lock_system_sleep();
> +       unsigned int sleep_flags;
> +
> +       sleep_flags = lock_system_sleep();
>         s2idle_ops = ops;
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>  }
>
>  static void s2idle_begin(void)
> @@ -200,7 +202,9 @@ __setup("mem_sleep_default=", mem_sleep_
>   */
>  void suspend_set_ops(const struct platform_suspend_ops *ops)
>  {
> -       lock_system_sleep();
> +       unsigned int sleep_flags;
> +
> +       sleep_flags = lock_system_sleep();
>
>         suspend_ops = ops;
>
> @@ -216,7 +220,7 @@ void suspend_set_ops(const struct platfo
>                         mem_sleep_current = PM_SUSPEND_MEM;
>         }
>
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>  }
>  EXPORT_SYMBOL_GPL(suspend_set_ops);
>
> --- a/kernel/power/user.c
> +++ b/kernel/power/user.c
> @@ -47,12 +47,13 @@ int is_hibernate_resume_dev(dev_t dev)
>  static int snapshot_open(struct inode *inode, struct file *filp)
>  {
>         struct snapshot_data *data;
> +       unsigned int sleep_flags;
>         int error;
>
>         if (!hibernation_available())
>                 return -EPERM;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         if (!hibernate_acquire()) {
>                 error = -EBUSY;
> @@ -98,7 +99,7 @@ static int snapshot_open(struct inode *i
>         data->dev = 0;
>
>   Unlock:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return error;
>  }
> @@ -106,8 +107,9 @@ static int snapshot_open(struct inode *i
>  static int snapshot_release(struct inode *inode, struct file *filp)
>  {
>         struct snapshot_data *data;
> +       unsigned int sleep_flags;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         swsusp_free();
>         data = filp->private_data;
> @@ -124,7 +126,7 @@ static int snapshot_release(struct inode
>                         PM_POST_HIBERNATION : PM_POST_RESTORE);
>         hibernate_release();
>
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return 0;
>  }
> @@ -132,11 +134,12 @@ static int snapshot_release(struct inode
>  static ssize_t snapshot_read(struct file *filp, char __user *buf,
>                               size_t count, loff_t *offp)
>  {
> +       loff_t pg_offp = *offp & ~PAGE_MASK;
>         struct snapshot_data *data;
> +       unsigned int sleep_flags;
>         ssize_t res;
> -       loff_t pg_offp = *offp & ~PAGE_MASK;
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         data = filp->private_data;
>         if (!data->ready) {
> @@ -157,7 +160,7 @@ static ssize_t snapshot_read(struct file
>                 *offp += res;
>
>   Unlock:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return res;
>  }
> @@ -165,16 +168,17 @@ static ssize_t snapshot_read(struct file
>  static ssize_t snapshot_write(struct file *filp, const char __user *buf,
>                                size_t count, loff_t *offp)
>  {
> +       loff_t pg_offp = *offp & ~PAGE_MASK;
>         struct snapshot_data *data;
> +       unsigned long sleep_flags;
>         ssize_t res;
> -       loff_t pg_offp = *offp & ~PAGE_MASK;
>
>         if (need_wait) {
>                 wait_for_device_probe();
>                 need_wait = false;
>         }
>
> -       lock_system_sleep();
> +       sleep_flags = lock_system_sleep();
>
>         data = filp->private_data;
>
> @@ -196,7 +200,7 @@ static ssize_t snapshot_write(struct fil
>         if (res > 0)
>                 *offp += res;
>  unlock:
> -       unlock_system_sleep();
> +       unlock_system_sleep(sleep_flags);
>
>         return res;
>  }
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction
  2022-08-22 11:18 ` [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
@ 2022-08-23 17:28   ` Rafael J. Wysocki
  0 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2022-08-23 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Eric Biederman,
	Sebastian Andrzej Siewior, Will Deacon,
	Linux Kernel Mailing List, Tejun Heo, Linux PM

On Mon, Aug 22, 2022 at 1:48 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure
> that the UMH, which is going to freeze the system, doesn't
> indefinitely wait for it's caller.
>
> Rework things by adding UMH_FREEZABLE to indicate the completion is
> freezable.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  include/linux/umh.h     |    9 +++++----
>  init/do_mounts_initrd.c |   10 +---------
>  kernel/umh.c            |    8 ++++++++
>  3 files changed, 14 insertions(+), 13 deletions(-)
>
> --- a/include/linux/umh.h
> +++ b/include/linux/umh.h
> @@ -11,10 +11,11 @@
>  struct cred;
>  struct file;
>
> -#define UMH_NO_WAIT    0       /* don't wait at all */
> -#define UMH_WAIT_EXEC  1       /* wait for the exec, but not the process */
> -#define UMH_WAIT_PROC  2       /* wait for the process to complete */
> -#define UMH_KILLABLE   4       /* wait for EXEC/PROC killable */
> +#define UMH_NO_WAIT    0x00    /* don't wait at all */
> +#define UMH_WAIT_EXEC  0x01    /* wait for the exec, but not the process */
> +#define UMH_WAIT_PROC  0x02    /* wait for the process to complete */
> +#define UMH_KILLABLE   0x04    /* wait for EXEC/PROC killable */
> +#define UMH_FREEZABLE  0x08    /* wait for EXEC/PROC freezable */
>
>  struct subprocess_info {
>         struct work_struct work;
> --- a/init/do_mounts_initrd.c
> +++ b/init/do_mounts_initrd.c
> @@ -79,19 +79,11 @@ static void __init handle_initrd(void)
>         init_mkdir("/old", 0700);
>         init_chdir("/old");
>
> -       /*
> -        * In case that a resume from disk is carried out by linuxrc or one of
> -        * its children, we need to tell the freezer not to wait for us.
> -        */
> -       current->flags |= PF_FREEZER_SKIP;
> -
>         info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
>                                          GFP_KERNEL, init_linuxrc, NULL, NULL);
>         if (!info)
>                 return;
> -       call_usermodehelper_exec(info, UMH_WAIT_PROC);
> -
> -       current->flags &= ~PF_FREEZER_SKIP;
> +       call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
>
>         /* move initrd to rootfs' /old */
>         init_mount("..", ".", NULL, MS_MOVE, NULL);
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -28,6 +28,7 @@
>  #include <linux/async.h>
>  #include <linux/uaccess.h>
>  #include <linux/initrd.h>
> +#include <linux/freezer.h>
>
>  #include <trace/events/module.h>
>
> @@ -436,6 +437,9 @@ int call_usermodehelper_exec(struct subp
>         if (wait == UMH_NO_WAIT)        /* task has freed sub_info */
>                 goto unlock;
>
> +       if (wait & UMH_FREEZABLE)
> +               freezer_do_not_count();
> +
>         if (wait & UMH_KILLABLE) {
>                 retval = wait_for_completion_killable(&done);
>                 if (!retval)
> @@ -448,6 +452,10 @@ int call_usermodehelper_exec(struct subp
>         }
>
>         wait_for_completion(&done);
> +
> +       if (wait & UMH_FREEZABLE)
> +               freezer_count();
> +
>  wait_done:
>         retval = sub_info->retval;
>  out:
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-08-22 11:18 ` [PATCH v3 4/6] sched/completion: Add wait_for_completion_state() Peter Zijlstra
@ 2022-08-23 17:32   ` Rafael J. Wysocki
  2022-08-26 21:54     ` Peter Zijlstra
  2022-09-04 10:46   ` Ingo Molnar
  1 sibling, 1 reply; 59+ messages in thread
From: Rafael J. Wysocki @ 2022-08-23 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Eric Biederman,
	Sebastian Andrzej Siewior, Will Deacon,
	Linux Kernel Mailing List, Tejun Heo, Linux PM

On Mon, Aug 22, 2022 at 1:48 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Allows waiting with a custom @state.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/completion.h |    1 +
>  kernel/sched/completion.c  |    9 +++++++++
>  2 files changed, 10 insertions(+)
>
> --- a/include/linux/completion.h
> +++ b/include/linux/completion.h
> @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
>  extern void wait_for_completion_io(struct completion *);
>  extern int wait_for_completion_interruptible(struct completion *x);
>  extern int wait_for_completion_killable(struct completion *x);
> +extern int wait_for_completion_state(struct completion *x, unsigned int state);
>  extern unsigned long wait_for_completion_timeout(struct completion *x,
>                                                    unsigned long timeout);
>  extern unsigned long wait_for_completion_io_timeout(struct completion *x,
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
>  }
>  EXPORT_SYMBOL(wait_for_completion_killable);
>
> +int __sched wait_for_completion_state(struct completion *x, unsigned int state)
> +{
> +       long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
> +       if (t == -ERESTARTSYS)
> +               return t;
> +       return 0;
> +}
> +EXPORT_SYMBOL(wait_for_completion_state);

Why not EXPORT_SYMBOL_GPL?  I guess to match the above?

> +
>  /**
>   * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
>   * @x:  holds the state of this particular completion
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
@ 2022-08-23 17:36   ` Rafael J. Wysocki
  2022-09-04 10:09   ` Ingo Molnar
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Rafael J. Wysocki @ 2022-08-23 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Oleg Nesterov, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Eric Biederman,
	Sebastian Andrzej Siewior, Will Deacon,
	Linux Kernel Mailing List, Tejun Heo, Linux PM

On Mon, Aug 22, 2022 at 1:48 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Rewrite the core freezer to behave better wrt thawing and be simpler
> in general.
>
> By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
> ensured frozen tasks stay frozen until thawed and don't randomly wake
> up early, as is currently possible.
>
> As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
> two PF_flags (yay!).
>
> Specifically; the current scheme works a little like:
>
>         freezer_do_not_count();
>         schedule();
>         freezer_count();
>
> And either the task is blocked, or it lands in try_to_freezer()
> through freezer_count(). Now, when it is blocked, the freezer
> considers it frozen and continues.
>
> However, on thawing, once pm_freezing is cleared, freezer_count()
> stops working, and any random/spurious wakeup will let a task run
> before its time.
>
> That is, thawing tries to thaw things in explicit order; kernel
> threads and workqueues before doing bringing SMP back before userspace
> etc.. However due to the above mentioned races it is entirely possible
> for userspace tasks to thaw (by accident) before SMP is back.
>
> This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
> where the userspace task requires a special CPU to run.
>
> As said; replace this with a special task state TASK_FROZEN and add
> the following state transitions:
>
>         TASK_FREEZABLE  -> TASK_FROZEN
>         __TASK_STOPPED  -> TASK_FROZEN
>         __TASK_TRACED   -> TASK_FROZEN
>
> The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
> (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
> is already required to deal with spurious wakeups and the freezer
> causes one such when thawing the task (since the original state is
> lost).
>
> The special __TASK_{STOPPED,TRACED} states *can* be restored since
> their canonical state is in ->jobctl.
>
> With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
> free of undue (early / spurious) wakeups.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/android/binder.c       |    4
>  drivers/media/pci/pt3/pt3.c    |    4
>  fs/cifs/inode.c                |    4
>  fs/cifs/transport.c            |    5
>  fs/coredump.c                  |    5
>  fs/nfs/file.c                  |    3
>  fs/nfs/inode.c                 |   12 --
>  fs/nfs/nfs3proc.c              |    3
>  fs/nfs/nfs4proc.c              |   14 +-
>  fs/nfs/nfs4state.c             |    3
>  fs/nfs/pnfs.c                  |    4
>  fs/xfs/xfs_trans_ail.c         |    8 -
>  include/linux/freezer.h        |  245 +----------------------------------------
>  include/linux/sched.h          |   41 +++---
>  include/linux/sunrpc/sched.h   |    7 -
>  include/linux/wait.h           |   12 +-
>  kernel/cgroup/legacy_freezer.c |   23 +--
>  kernel/exit.c                  |    4
>  kernel/fork.c                  |    5
>  kernel/freezer.c               |  133 ++++++++++++++++------
>  kernel/futex/waitwake.c        |    8 -
>  kernel/hung_task.c             |    4
>  kernel/power/main.c            |    6 -
>  kernel/power/process.c         |   10 -
>  kernel/ptrace.c                |    2
>  kernel/sched/core.c            |    2
>  kernel/signal.c                |   14 +-
>  kernel/time/hrtimer.c          |    4
>  kernel/umh.c                   |   20 +--
>  mm/khugepaged.c                |    4
>  net/sunrpc/sched.c             |   12 --
>  net/unix/af_unix.c             |    8 -
>  32 files changed, 224 insertions(+), 409 deletions(-)
>
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -4247,10 +4247,9 @@ static int binder_wait_for_work(struct b
>         struct binder_proc *proc = thread->proc;
>         int ret = 0;
>
> -       freezer_do_not_count();
>         binder_inner_proc_lock(proc);
>         for (;;) {
> -               prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
> +               prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>                 if (binder_has_work_ilocked(thread, do_proc_work))
>                         break;
>                 if (do_proc_work)
> @@ -4267,7 +4266,6 @@ static int binder_wait_for_work(struct b
>         }
>         finish_wait(&thread->wait, &wait);
>         binder_inner_proc_unlock(proc);
> -       freezer_count();
>
>         return ret;
>  }
> --- a/drivers/media/pci/pt3/pt3.c
> +++ b/drivers/media/pci/pt3/pt3.c
> @@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
>                 pt3_proc_dma(adap);
>
>                 delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> -               freezable_schedule_hrtimeout_range(&delay,
> +               set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
> +               schedule_hrtimeout_range(&delay,
>                                         PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
>                                         HRTIMER_MODE_REL);
>         }
> --- a/fs/cifs/inode.c
> +++ b/fs/cifs/inode.c
> @@ -2327,7 +2327,7 @@ cifs_invalidate_mapping(struct inode *in
>  static int
>  cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
>  {
> -       freezable_schedule_unsafe();
> +       schedule();
>         if (signal_pending_state(mode, current))
>                 return -ERESTARTSYS;
>         return 0;
> @@ -2345,7 +2345,7 @@ cifs_revalidate_mapping(struct inode *in
>                 return 0;
>
>         rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
> -                                    TASK_KILLABLE);
> +                                    TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>         if (rc)
>                 return rc;
>
> --- a/fs/cifs/transport.c
> +++ b/fs/cifs/transport.c
> @@ -757,8 +757,9 @@ wait_for_response(struct TCP_Server_Info
>  {
>         int error;
>
> -       error = wait_event_freezekillable_unsafe(server->response_q,
> -                                   midQ->mid_state != MID_REQUEST_SUBMITTED);
> +       error = wait_event_state(server->response_q,
> +                                midQ->mid_state != MID_REQUEST_SUBMITTED,
> +                                (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
>         if (error < 0)
>                 return -ERESTARTSYS;
>
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -402,9 +402,8 @@ static int coredump_wait(int exit_code,
>         if (core_waiters > 0) {
>                 struct core_thread *ptr;
>
> -               freezer_do_not_count();
> -               wait_for_completion(&core_state->startup);
> -               freezer_count();
> +               wait_for_completion_state(&core_state->startup,
> +                                         TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
>                 /*
>                  * Wait for all the threads to become inactive, so that
>                  * all the thread context (extended register state, like
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -570,7 +570,8 @@ static vm_fault_t nfs_vm_page_mkwrite(st
>         }
>
>         wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
> -                       nfs_wait_bit_killable, TASK_KILLABLE);
> +                          nfs_wait_bit_killable,
> +                          TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>
>         lock_page(page);
>         mapping = page_file_mapping(page);
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat
>         return nfs_fileid_to_ino_t(fattr->fileid);
>  }
>
> -static int nfs_wait_killable(int mode)
> +int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
>  {
> -       freezable_schedule_unsafe();
> +       schedule();
>         if (signal_pending_state(mode, current))
>                 return -ERESTARTSYS;
>         return 0;
>  }
> -
> -int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
> -{
> -       return nfs_wait_killable(mode);
> -}
>  EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
>
>  /**
> @@ -1331,7 +1326,8 @@ int nfs_clear_invalid_mapping(struct add
>          */
>         for (;;) {
>                 ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
> -                                        nfs_wait_bit_killable, TASK_KILLABLE);
> +                                        nfs_wait_bit_killable,
> +                                        TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>                 if (ret)
>                         goto out;
>                 spin_lock(&inode->i_lock);
> --- a/fs/nfs/nfs3proc.c
> +++ b/fs/nfs/nfs3proc.c
> @@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt,
>                 res = rpc_call_sync(clnt, msg, flags);
>                 if (res != -EJUKEBOX)
>                         break;
> -               freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
> +               __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
> +               schedule_timeout(NFS_JUKEBOX_RETRY_TIME);
>                 res = -ERESTARTSYS;
>         } while (!fatal_signal_pending(current));
>         return res;
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -416,8 +416,8 @@ static int nfs4_delay_killable(long *tim
>  {
>         might_sleep();
>
> -       freezable_schedule_timeout_killable_unsafe(
> -               nfs4_update_delay(timeout));
> +       __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
> +       schedule_timeout(nfs4_update_delay(timeout));
>         if (!__fatal_signal_pending(current))
>                 return 0;
>         return -EINTR;
> @@ -427,7 +427,8 @@ static int nfs4_delay_interruptible(long
>  {
>         might_sleep();
>
> -       freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
> +       __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
> +       schedule_timeout(nfs4_update_delay(timeout));
>         if (!signal_pending(current))
>                 return 0;
>         return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
> @@ -7406,7 +7407,8 @@ nfs4_retry_setlk_simple(struct nfs4_stat
>                 status = nfs4_proc_setlk(state, cmd, request);
>                 if ((status != -EAGAIN) || IS_SETLK(cmd))
>                         break;
> -               freezable_schedule_timeout_interruptible(timeout);
> +               __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> +               schedule_timeout(timeout);
>                 timeout *= 2;
>                 timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
>                 status = -ERESTARTSYS;
> @@ -7474,10 +7476,8 @@ nfs4_retry_setlk(struct nfs4_state *stat
>                         break;
>
>                 status = -ERESTARTSYS;
> -               freezer_do_not_count();
> -               wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
> +               wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
>                            NFS4_LOCK_MAXTIMEOUT);
> -               freezer_count();
>         } while (!signalled());
>
>         remove_wait_queue(q, &waiter.wait);
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -1314,7 +1314,8 @@ int nfs4_wait_clnt_recover(struct nfs_cl
>
>         refcount_inc(&clp->cl_count);
>         res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
> -                                nfs_wait_bit_killable, TASK_KILLABLE);
> +                                nfs_wait_bit_killable,
> +                                TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>         if (res)
>                 goto out;
>         if (clp->cl_cons_state < 0)
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1908,7 +1908,7 @@ static int pnfs_prepare_to_retry_layoutg
>         pnfs_layoutcommit_inode(lo->plh_inode, false);
>         return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
>                                    nfs_wait_bit_killable,
> -                                  TASK_KILLABLE);
> +                                  TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>  }
>
>  static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
> @@ -3193,7 +3193,7 @@ pnfs_layoutcommit_inode(struct inode *in
>                 status = wait_on_bit_lock_action(&nfsi->flags,
>                                 NFS_INO_LAYOUTCOMMITTING,
>                                 nfs_wait_bit_killable,
> -                               TASK_KILLABLE);
> +                               TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>                 if (status)
>                         goto out;
>         }
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -602,9 +602,9 @@ xfsaild(
>
>         while (1) {
>                 if (tout && tout <= 20)
> -                       set_current_state(TASK_KILLABLE);
> +                       set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
>                 else
> -                       set_current_state(TASK_INTERRUPTIBLE);
> +                       set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>
>                 /*
>                  * Check kthread_should_stop() after we set the task state to
> @@ -653,14 +653,14 @@ xfsaild(
>                     ailp->ail_target == ailp->ail_target_prev &&
>                     list_empty(&ailp->ail_buf_list)) {
>                         spin_unlock(&ailp->ail_lock);
> -                       freezable_schedule();
> +                       schedule();
>                         tout = 0;
>                         continue;
>                 }
>                 spin_unlock(&ailp->ail_lock);
>
>                 if (tout)
> -                       freezable_schedule_timeout(msecs_to_jiffies(tout));
> +                       schedule_timeout(msecs_to_jiffies(tout));
>
>                 __set_current_state(TASK_RUNNING);
>
> --- a/include/linux/freezer.h
> +++ b/include/linux/freezer.h
> @@ -8,9 +8,11 @@
>  #include <linux/sched.h>
>  #include <linux/wait.h>
>  #include <linux/atomic.h>
> +#include <linux/jump_label.h>
>
>  #ifdef CONFIG_FREEZER
> -extern atomic_t system_freezing_cnt;   /* nr of freezing conds in effect */
> +DECLARE_STATIC_KEY_FALSE(freezer_active);
> +
>  extern bool pm_freezing;               /* PM freezing in effect */
>  extern bool pm_nosig_freezing;         /* PM nosig freezing in effect */
>
> @@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs
>  /*
>   * Check if a process has been frozen
>   */
> -static inline bool frozen(struct task_struct *p)
> -{
> -       return p->flags & PF_FROZEN;
> -}
> +extern bool frozen(struct task_struct *p);
>
>  extern bool freezing_slow_path(struct task_struct *p);
>
> @@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct ta
>   */
>  static inline bool freezing(struct task_struct *p)
>  {
> -       if (likely(!atomic_read(&system_freezing_cnt)))
> -               return false;
> -       return freezing_slow_path(p);
> +       if (static_branch_unlikely(&freezer_active))
> +               return freezing_slow_path(p);
> +
> +       return false;
>  }
>
>  /* Takes and releases task alloc lock using task_lock() */
> @@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
>  extern void thaw_processes(void);
>  extern void thaw_kernel_threads(void);
>
> -/*
> - * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
> - * If try_to_freeze causes a lockdep warning it means the caller may deadlock
> - */
> -static inline bool try_to_freeze_unsafe(void)
> +static inline bool try_to_freeze(void)
>  {
>         might_sleep();
>         if (likely(!freezing(current)))
>                 return false;
> -       return __refrigerator(false);
> -}
> -
> -static inline bool try_to_freeze(void)
> -{
>         if (!(current->flags & PF_NOFREEZE))
>                 debug_check_no_locks_held();
> -       return try_to_freeze_unsafe();
> +       return __refrigerator(false);
>  }
>
>  extern bool freeze_task(struct task_struct *p);
> @@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struc
>  }
>  #endif /* !CONFIG_CGROUP_FREEZER */
>
> -/*
> - * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
> - * calls wait_for_completion(&vfork) and reset right after it returns from this
> - * function.  Next, the parent should call try_to_freeze() to freeze itself
> - * appropriately in case the child has exited before the freezing of tasks is
> - * complete.  However, we don't want kernel threads to be frozen in unexpected
> - * places, so we allow them to block freeze_processes() instead or to set
> - * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
> - * parent won't really block freeze_processes(), since ____call_usermodehelper()
> - * (the child) does a little before exec/exit and it can't be frozen before
> - * waking up the parent.
> - */
> -
> -
> -/**
> - * freezer_do_not_count - tell freezer to ignore %current
> - *
> - * Tell freezers to ignore the current task when determining whether the
> - * target frozen state is reached.  IOW, the current task will be
> - * considered frozen enough by freezers.
> - *
> - * The caller shouldn't do anything which isn't allowed for a frozen task
> - * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
> - * wrap a scheduling operation and nothing much else.
> - */
> -static inline void freezer_do_not_count(void)
> -{
> -       current->flags |= PF_FREEZER_SKIP;
> -}
> -
> -/**
> - * freezer_count - tell freezer to stop ignoring %current
> - *
> - * Undo freezer_do_not_count().  It tells freezers that %current should be
> - * considered again and tries to freeze if freezing condition is already in
> - * effect.
> - */
> -static inline void freezer_count(void)
> -{
> -       current->flags &= ~PF_FREEZER_SKIP;
> -       /*
> -        * If freezing is in progress, the following paired with smp_mb()
> -        * in freezer_should_skip() ensures that either we see %true
> -        * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
> -        */
> -       smp_mb();
> -       try_to_freeze();
> -}
> -
> -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
> -static inline void freezer_count_unsafe(void)
> -{
> -       current->flags &= ~PF_FREEZER_SKIP;
> -       smp_mb();
> -       try_to_freeze_unsafe();
> -}
> -
> -/**
> - * freezer_should_skip - whether to skip a task when determining frozen
> - *                      state is reached
> - * @p: task in quesion
> - *
> - * This function is used by freezers after establishing %true freezing() to
> - * test whether a task should be skipped when determining the target frozen
> - * state is reached.  IOW, if this function returns %true, @p is considered
> - * frozen enough.
> - */
> -static inline bool freezer_should_skip(struct task_struct *p)
> -{
> -       /*
> -        * The following smp_mb() paired with the one in freezer_count()
> -        * ensures that either freezer_count() sees %true freezing() or we
> -        * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
> -        * impossible for a task to slip frozen state testing after
> -        * clearing %PF_FREEZER_SKIP.
> -        */
> -       smp_mb();
> -       return p->flags & PF_FREEZER_SKIP;
> -}
> -
> -/*
> - * These functions are intended to be used whenever you want allow a sleeping
> - * task to be frozen. Note that neither return any clear indication of
> - * whether a freeze event happened while in this function.
> - */
> -
> -/* Like schedule(), but should not block the freezer. */
> -static inline void freezable_schedule(void)
> -{
> -       freezer_do_not_count();
> -       schedule();
> -       freezer_count();
> -}
> -
> -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
> -static inline void freezable_schedule_unsafe(void)
> -{
> -       freezer_do_not_count();
> -       schedule();
> -       freezer_count_unsafe();
> -}
> -
> -/*
> - * Like schedule_timeout(), but should not block the freezer.  Do not
> - * call this with locks held.
> - */
> -static inline long freezable_schedule_timeout(long timeout)
> -{
> -       long __retval;
> -       freezer_do_not_count();
> -       __retval = schedule_timeout(timeout);
> -       freezer_count();
> -       return __retval;
> -}
> -
> -/*
> - * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
> - * call this with locks held.
> - */
> -static inline long freezable_schedule_timeout_interruptible(long timeout)
> -{
> -       long __retval;
> -       freezer_do_not_count();
> -       __retval = schedule_timeout_interruptible(timeout);
> -       freezer_count();
> -       return __retval;
> -}
> -
> -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
> -static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
> -{
> -       long __retval;
> -
> -       freezer_do_not_count();
> -       __retval = schedule_timeout_interruptible(timeout);
> -       freezer_count_unsafe();
> -       return __retval;
> -}
> -
> -/* Like schedule_timeout_killable(), but should not block the freezer. */
> -static inline long freezable_schedule_timeout_killable(long timeout)
> -{
> -       long __retval;
> -       freezer_do_not_count();
> -       __retval = schedule_timeout_killable(timeout);
> -       freezer_count();
> -       return __retval;
> -}
> -
> -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
> -static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
> -{
> -       long __retval;
> -       freezer_do_not_count();
> -       __retval = schedule_timeout_killable(timeout);
> -       freezer_count_unsafe();
> -       return __retval;
> -}
> -
> -/*
> - * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
> - * call this with locks held.
> - */
> -static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
> -               u64 delta, const enum hrtimer_mode mode)
> -{
> -       int __retval;
> -       freezer_do_not_count();
> -       __retval = schedule_hrtimeout_range(expires, delta, mode);
> -       freezer_count();
> -       return __retval;
> -}
> -
> -/*
> - * Freezer-friendly wrappers around wait_event_interruptible(),
> - * wait_event_killable() and wait_event_interruptible_timeout(), originally
> - * defined in <linux/wait.h>
> - */
> -
> -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
> -#define wait_event_freezekillable_unsafe(wq, condition)                        \
> -({                                                                     \
> -       int __retval;                                                   \
> -       freezer_do_not_count();                                         \
> -       __retval = wait_event_killable(wq, (condition));                \
> -       freezer_count_unsafe();                                         \
> -       __retval;                                                       \
> -})
> -
>  #else /* !CONFIG_FREEZER */
>  static inline bool frozen(struct task_struct *p) { return false; }
>  static inline bool freezing(struct task_struct *p) { return false; }
> @@ -281,35 +83,8 @@ static inline void thaw_kernel_threads(v
>
>  static inline bool try_to_freeze(void) { return false; }
>
> -static inline void freezer_do_not_count(void) {}
> -static inline void freezer_count(void) {}
> -static inline int freezer_should_skip(struct task_struct *p) { return 0; }
>  static inline void set_freezable(void) {}
>
> -#define freezable_schedule()  schedule()
> -
> -#define freezable_schedule_unsafe()  schedule()
> -
> -#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
> -
> -#define freezable_schedule_timeout_interruptible(timeout)              \
> -       schedule_timeout_interruptible(timeout)
> -
> -#define freezable_schedule_timeout_interruptible_unsafe(timeout)       \
> -       schedule_timeout_interruptible(timeout)
> -
> -#define freezable_schedule_timeout_killable(timeout)                   \
> -       schedule_timeout_killable(timeout)
> -
> -#define freezable_schedule_timeout_killable_unsafe(timeout)            \
> -       schedule_timeout_killable(timeout)
> -
> -#define freezable_schedule_hrtimeout_range(expires, delta, mode)       \
> -       schedule_hrtimeout_range(expires, delta, mode)
> -
> -#define wait_event_freezekillable_unsafe(wq, condition)                        \
> -               wait_event_killable(wq, condition)
> -
>  #endif /* !CONFIG_FREEZER */
>
>  #endif /* FREEZER_H_INCLUDED */
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -81,25 +81,32 @@ struct task_group;
>   */
>
>  /* Used in tsk->state: */
> -#define TASK_RUNNING                   0x0000
> -#define TASK_INTERRUPTIBLE             0x0001
> -#define TASK_UNINTERRUPTIBLE           0x0002
> -#define __TASK_STOPPED                 0x0004
> -#define __TASK_TRACED                  0x0008
> +#define TASK_RUNNING                   0x000000
> +#define TASK_INTERRUPTIBLE             0x000001
> +#define TASK_UNINTERRUPTIBLE           0x000002
> +#define __TASK_STOPPED                 0x000004
> +#define __TASK_TRACED                  0x000008
>  /* Used in tsk->exit_state: */
> -#define EXIT_DEAD                      0x0010
> -#define EXIT_ZOMBIE                    0x0020
> +#define EXIT_DEAD                      0x000010
> +#define EXIT_ZOMBIE                    0x000020
>  #define EXIT_TRACE                     (EXIT_ZOMBIE | EXIT_DEAD)
>  /* Used in tsk->state again: */
> -#define TASK_PARKED                    0x0040
> -#define TASK_DEAD                      0x0080
> -#define TASK_WAKEKILL                  0x0100
> -#define TASK_WAKING                    0x0200
> -#define TASK_NOLOAD                    0x0400
> -#define TASK_NEW                       0x0800
> -/* RT specific auxilliary flag to mark RT lock waiters */
> -#define TASK_RTLOCK_WAIT               0x1000
> -#define TASK_STATE_MAX                 0x2000
> +#define TASK_PARKED                    0x000040
> +#define TASK_DEAD                      0x000080
> +#define TASK_WAKEKILL                  0x000100
> +#define TASK_WAKING                    0x000200
> +#define TASK_NOLOAD                    0x000400
> +#define TASK_NEW                       0x000800
> +#define TASK_FREEZABLE                 0x001000
> +#define __TASK_FREEZABLE_UNSAFE               (0x002000 * IS_ENABLED(CONFIG_LOCKDEP))
> +#define TASK_FROZEN                    0x004000
> +#define TASK_RTLOCK_WAIT               0x008000
> +#define TASK_STATE_MAX                 0x010000
> +
> +/*
> + * DO NOT ADD ANY NEW USERS !
> + */
> +#define TASK_FREEZABLE_UNSAFE          (TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
>
>  /* Convenience macros for the sake of set_current_state: */
>  #define TASK_KILLABLE                  (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
> @@ -1714,7 +1721,6 @@ extern struct pid *cad_pid;
>  #define PF_NPROC_EXCEEDED      0x00001000      /* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH           0x00002000      /* If unset the fpu must be initialized before use */
>  #define PF_NOFREEZE            0x00008000      /* This thread should not be frozen */
> -#define PF_FROZEN              0x00010000      /* Frozen for system suspend */
>  #define PF_KSWAPD              0x00020000      /* I am kswapd */
>  #define PF_MEMALLOC_NOFS       0x00040000      /* All allocation requests will inherit GFP_NOFS */
>  #define PF_MEMALLOC_NOIO       0x00080000      /* All allocation requests will inherit GFP_NOIO */
> @@ -1725,7 +1731,6 @@ extern struct pid *cad_pid;
>  #define PF_NO_SETAFFINITY      0x04000000      /* Userland is not allowed to meddle with cpus_mask */
>  #define PF_MCE_EARLY           0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMALLOC_PIN                0x10000000      /* Allocation context constrained to zones which allow long term pinning. */
> -#define PF_FREEZER_SKIP                0x40000000      /* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK                0x80000000      /* This thread called freeze_processes() and should not be frozen */
>
>  /*
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -252,7 +252,7 @@ int         rpc_malloc(struct rpc_task *);
>  void           rpc_free(struct rpc_task *);
>  int            rpciod_up(void);
>  void           rpciod_down(void);
> -int            __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
> +int            rpc_wait_for_completion_task(struct rpc_task *task);
>  #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>  struct net;
>  void           rpc_show_tasks(struct net *);
> @@ -264,11 +264,6 @@ extern struct workqueue_struct *xprtiod_
>  void           rpc_prepare_task(struct rpc_task *task);
>  gfp_t          rpc_task_gfp_mask(void);
>
> -static inline int rpc_wait_for_completion_task(struct rpc_task *task)
> -{
> -       return __rpc_wait_for_completion_task(task, NULL);
> -}
> -
>  #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
>  static inline const char * rpc_qname(const struct rpc_wait_queue *q)
>  {
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -361,8 +361,8 @@ do {                                                                                \
>  } while (0)
>
>  #define __wait_event_freezable(wq_head, condition)                             \
> -       ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,             \
> -                           freezable_schedule())
> +       ___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),  \
> +                       0, 0, schedule())
>
>  /**
>   * wait_event_freezable - sleep (or freeze) until a condition gets true
> @@ -420,8 +420,8 @@ do {                                                                                \
>
>  #define __wait_event_freezable_timeout(wq_head, condition, timeout)            \
>         ___wait_event(wq_head, ___wait_cond_timeout(condition),                 \
> -                     TASK_INTERRUPTIBLE, 0, timeout,                           \
> -                     __ret = freezable_schedule_timeout(__ret))
> +                     (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,          \
> +                     __ret = schedule_timeout(__ret))
>
>  /*
>   * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
> @@ -642,8 +642,8 @@ do {                                                                                \
>
>
>  #define __wait_event_freezable_exclusive(wq, condition)                                \
> -       ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,                  \
> -                       freezable_schedule())
> +       ___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
> +                       schedule())
>
>  #define wait_event_freezable_exclusive(wq, condition)                          \
>  ({                                                                             \
> --- a/kernel/cgroup/legacy_freezer.c
> +++ b/kernel/cgroup/legacy_freezer.c
> @@ -113,7 +113,7 @@ static int freezer_css_online(struct cgr
>
>         if (parent && (parent->state & CGROUP_FREEZING)) {
>                 freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
> -               atomic_inc(&system_freezing_cnt);
> +               static_branch_inc(&freezer_active);
>         }
>
>         mutex_unlock(&freezer_mutex);
> @@ -134,7 +134,7 @@ static void freezer_css_offline(struct c
>         mutex_lock(&freezer_mutex);
>
>         if (freezer->state & CGROUP_FREEZING)
> -               atomic_dec(&system_freezing_cnt);
> +               static_branch_dec(&freezer_active);
>
>         freezer->state = 0;
>
> @@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup
>                         __thaw_task(task);
>                 } else {
>                         freeze_task(task);
> +
>                         /* clear FROZEN and propagate upwards */
>                         while (freezer && (freezer->state & CGROUP_FROZEN)) {
>                                 freezer->state &= ~CGROUP_FROZEN;
> @@ -271,16 +272,8 @@ static void update_if_frozen(struct cgro
>         css_task_iter_start(css, 0, &it);
>
>         while ((task = css_task_iter_next(&it))) {
> -               if (freezing(task)) {
> -                       /*
> -                        * freezer_should_skip() indicates that the task
> -                        * should be skipped when determining freezing
> -                        * completion.  Consider it frozen in addition to
> -                        * the usual frozen condition.
> -                        */
> -                       if (!frozen(task) && !freezer_should_skip(task))
> -                               goto out_iter_end;
> -               }
> +               if (freezing(task) && !frozen(task))
> +                       goto out_iter_end;
>         }
>
>         freezer->state |= CGROUP_FROZEN;
> @@ -357,7 +350,7 @@ static void freezer_apply_state(struct f
>
>         if (freeze) {
>                 if (!(freezer->state & CGROUP_FREEZING))
> -                       atomic_inc(&system_freezing_cnt);
> +                       static_branch_inc(&freezer_active);
>                 freezer->state |= state;
>                 freeze_cgroup(freezer);
>         } else {
> @@ -366,9 +359,9 @@ static void freezer_apply_state(struct f
>                 freezer->state &= ~state;
>
>                 if (!(freezer->state & CGROUP_FREEZING)) {
> -                       if (was_freezing)
> -                               atomic_dec(&system_freezing_cnt);
>                         freezer->state &= ~CGROUP_FROZEN;
> +                       if (was_freezing)
> +                               static_branch_dec(&freezer_active);
>                         unfreeze_cgroup(freezer);
>                 }
>         }
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -374,10 +374,10 @@ static void coredump_task_exit(struct ta
>                         complete(&core_state->startup);
>
>                 for (;;) {
> -                       set_current_state(TASK_UNINTERRUPTIBLE);
> +                       set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
>                         if (!self.task) /* see coredump_finish() */
>                                 break;
> -                       freezable_schedule();
> +                       schedule();
>                 }
>                 __set_current_state(TASK_RUNNING);
>         }
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1420,13 +1420,12 @@ static void complete_vfork_done(struct t
>  static int wait_for_vfork_done(struct task_struct *child,
>                                 struct completion *vfork)
>  {
> +       unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE;
>         int killed;
>
> -       freezer_do_not_count();
>         cgroup_enter_frozen();
> -       killed = wait_for_completion_killable(vfork);
> +       killed = wait_for_completion_state(vfork, state);
>         cgroup_leave_frozen(false);
> -       freezer_count();
>
>         if (killed) {
>                 task_lock(child);
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -13,10 +13,11 @@
>  #include <linux/kthread.h>
>
>  /* total number of freezing conditions in effect */
> -atomic_t system_freezing_cnt = ATOMIC_INIT(0);
> -EXPORT_SYMBOL(system_freezing_cnt);
> +DEFINE_STATIC_KEY_FALSE(freezer_active);
> +EXPORT_SYMBOL(freezer_active);
>
> -/* indicate whether PM freezing is in effect, protected by
> +/*
> + * indicate whether PM freezing is in effect, protected by
>   * system_transition_mutex
>   */
>  bool pm_freezing;
> @@ -29,7 +30,7 @@ static DEFINE_SPINLOCK(freezer_lock);
>   * freezing_slow_path - slow path for testing whether a task needs to be frozen
>   * @p: task to be tested
>   *
> - * This function is called by freezing() if system_freezing_cnt isn't zero
> + * This function is called by freezing() if freezer_active isn't zero
>   * and tests whether @p needs to enter and stay in frozen state.  Can be
>   * called under any context.  The freezers are responsible for ensuring the
>   * target tasks see the updated state.
> @@ -52,41 +53,40 @@ bool freezing_slow_path(struct task_stru
>  }
>  EXPORT_SYMBOL(freezing_slow_path);
>
> +bool frozen(struct task_struct *p)
> +{
> +       return READ_ONCE(p->__state) & TASK_FROZEN;
> +}
> +
>  /* Refrigerator is place where frozen processes are stored :-). */
>  bool __refrigerator(bool check_kthr_stop)
>  {
> -       /* Hmm, should we be allowed to suspend when there are realtime
> -          processes around? */
> +       unsigned int state = get_current_state();
>         bool was_frozen = false;
> -       unsigned int save = get_current_state();
>
>         pr_debug("%s entered refrigerator\n", current->comm);
>
> +       WARN_ON_ONCE(state && !(state & TASK_NORMAL));
> +
>         for (;;) {
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> +               bool freeze;
> +
> +               set_current_state(TASK_FROZEN);
>
>                 spin_lock_irq(&freezer_lock);
> -               current->flags |= PF_FROZEN;
> -               if (!freezing(current) ||
> -                   (check_kthr_stop && kthread_should_stop()))
> -                       current->flags &= ~PF_FROZEN;
> +               freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
>                 spin_unlock_irq(&freezer_lock);
>
> -               if (!(current->flags & PF_FROZEN))
> +               if (!freeze)
>                         break;
> +
>                 was_frozen = true;
>                 schedule();
>         }
> +       __set_current_state(TASK_RUNNING);
>
>         pr_debug("%s left refrigerator\n", current->comm);
>
> -       /*
> -        * Restore saved task state before returning.  The mb'd version
> -        * needs to be used; otherwise, it might silently break
> -        * synchronization which depends on ordered task state change.
> -        */
> -       set_current_state(save);
> -
>         return was_frozen;
>  }
>  EXPORT_SYMBOL(__refrigerator);
> @@ -101,6 +101,44 @@ static void fake_signal_wake_up(struct t
>         }
>  }
>
> +static int __set_task_frozen(struct task_struct *p, void *arg)
> +{
> +       unsigned int state = READ_ONCE(p->__state);
> +
> +       if (p->on_rq)
> +               return 0;
> +
> +       if (p != current && task_curr(p))
> +               return 0;
> +
> +       if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED)))
> +               return 0;
> +
> +       /*
> +        * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they
> +        * can suffer spurious wakeups.
> +        */
> +       if (state & TASK_FREEZABLE)
> +               WARN_ON_ONCE(!(state & TASK_NORMAL));
> +
> +#ifdef CONFIG_LOCKDEP
> +       /*
> +        * It's dangerous to freeze with locks held; there be dragons there.
> +        */
> +       if (!(state & __TASK_FREEZABLE_UNSAFE))
> +               WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> +#endif
> +
> +       WRITE_ONCE(p->__state, TASK_FROZEN);
> +       return TASK_FROZEN;
> +}
> +
> +static bool __freeze_task(struct task_struct *p)
> +{
> +       /* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */
> +       return task_call_func(p, __set_task_frozen, NULL);
> +}
> +
>  /**
>   * freeze_task - send a freeze request to given task
>   * @p: task to send the request to
> @@ -116,20 +154,8 @@ bool freeze_task(struct task_struct *p)
>  {
>         unsigned long flags;
>
> -       /*
> -        * This check can race with freezer_do_not_count, but worst case that
> -        * will result in an extra wakeup being sent to the task.  It does not
> -        * race with freezer_count(), the barriers in freezer_count() and
> -        * freezer_should_skip() ensure that either freezer_count() sees
> -        * freezing == true in try_to_freeze() and freezes, or
> -        * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
> -        * normally.
> -        */
> -       if (freezer_should_skip(p))
> -               return false;
> -
>         spin_lock_irqsave(&freezer_lock, flags);
> -       if (!freezing(p) || frozen(p)) {
> +       if (!freezing(p) || frozen(p) || __freeze_task(p)) {
>                 spin_unlock_irqrestore(&freezer_lock, flags);
>                 return false;
>         }
> @@ -137,19 +163,52 @@ bool freeze_task(struct task_struct *p)
>         if (!(p->flags & PF_KTHREAD))
>                 fake_signal_wake_up(p);
>         else
> -               wake_up_state(p, TASK_INTERRUPTIBLE);
> +               wake_up_state(p, TASK_NORMAL);
>
>         spin_unlock_irqrestore(&freezer_lock, flags);
>         return true;
>  }
>
> +/*
> + * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical
> + * state in p->jobctl. If either of them got a wakeup that was missed because
> + * TASK_FROZEN, then their canonical state reflects that and the below will
> + * refuse to restore the special state and instead issue the wakeup.
> + */
> +static int __set_task_special(struct task_struct *p, void *arg)
> +{
> +       unsigned int state = 0;
> +
> +       if (p->jobctl & JOBCTL_TRACED)
> +               state = TASK_TRACED;
> +
> +       else if (p->jobctl & JOBCTL_STOPPED)
> +               state = TASK_STOPPED;
> +
> +       if (state)
> +               WRITE_ONCE(p->__state, state);
> +
> +       return state;
> +}
> +
>  void __thaw_task(struct task_struct *p)
>  {
> -       unsigned long flags;
> +       unsigned long flags, flags2;
>
>         spin_lock_irqsave(&freezer_lock, flags);
> -       if (frozen(p))
> -               wake_up_process(p);
> +       if (WARN_ON_ONCE(freezing(p)))
> +               goto unlock;
> +
> +       if (lock_task_sighand(p, &flags2)) {
> +               /* TASK_FROZEN -> TASK_{STOPPED,TRACED} */
> +               bool ret = task_call_func(p, __set_task_special, NULL);
> +               unlock_task_sighand(p, &flags2);
> +               if (ret)
> +                       goto unlock;
> +       }
> +
> +       wake_up_state(p, TASK_FROZEN);
> +unlock:
>         spin_unlock_irqrestore(&freezer_lock, flags);
>  }
>
> --- a/kernel/futex/waitwake.c
> +++ b/kernel/futex/waitwake.c
> @@ -334,7 +334,7 @@ void futex_wait_queue(struct futex_hash_
>          * futex_queue() calls spin_unlock() upon completion, both serializing
>          * access to the hash list and forcing another memory barrier.
>          */
> -       set_current_state(TASK_INTERRUPTIBLE);
> +       set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>         futex_queue(q, hb);
>
>         /* Arm the timer */
> @@ -352,7 +352,7 @@ void futex_wait_queue(struct futex_hash_
>                  * is no timeout, or if it has yet to expire.
>                  */
>                 if (!timeout || timeout->task)
> -                       freezable_schedule();
> +                       schedule();
>         }
>         __set_current_state(TASK_RUNNING);
>  }
> @@ -430,7 +430,7 @@ static int futex_wait_multiple_setup(str
>                         return ret;
>         }
>
> -       set_current_state(TASK_INTERRUPTIBLE);
> +       set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>
>         for (i = 0; i < count; i++) {
>                 u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
> @@ -504,7 +504,7 @@ static void futex_sleep_multiple(struct
>                         return;
>         }
>
> -       freezable_schedule();
> +       schedule();
>  }
>
>  /**
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -95,8 +95,8 @@ static void check_hung_task(struct task_
>          * Ensure the task is not frozen.
>          * Also, skip vfork and any other user process that freezer should skip.
>          */
> -       if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
> -           return;
> +       if (unlikely(READ_ONCE(t->__state) & (TASK_FREEZABLE | TASK_FROZEN)))
> +               return;
>
>         /*
>          * When a freshly created task is scheduled once, changes its state to
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -24,7 +24,7 @@
>  unsigned int lock_system_sleep(void)
>  {
>         unsigned int flags = current->flags;
> -       current->flags |= PF_FREEZER_SKIP;
> +       current->flags |= PF_NOFREEZE;
>         mutex_lock(&system_transition_mutex);
>         return flags;
>  }
> @@ -48,8 +48,8 @@ void unlock_system_sleep(unsigned int fl
>          * Which means, if we use try_to_freeze() here, it would make them
>          * enter the refrigerator, thus causing hibernation to lockup.
>          */
> -       if (!(flags & PF_FREEZER_SKIP))
> -               current->flags &= ~PF_FREEZER_SKIP;
> +       if (!(flags & PF_NOFREEZE))
> +               current->flags &= ~PF_NOFREEZE;
>         mutex_unlock(&system_transition_mutex);
>  }
>  EXPORT_SYMBOL_GPL(unlock_system_sleep);
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -50,8 +50,7 @@ static int try_to_freeze_tasks(bool user
>                         if (p == current || !freeze_task(p))
>                                 continue;
>
> -                       if (!freezer_should_skip(p))
> -                               todo++;
> +                       todo++;
>                 }
>                 read_unlock(&tasklist_lock);
>
> @@ -96,8 +95,7 @@ static int try_to_freeze_tasks(bool user
>                 if (!wakeup || pm_debug_messages_on) {
>                         read_lock(&tasklist_lock);
>                         for_each_process_thread(g, p) {
> -                               if (p != current && !freezer_should_skip(p)
> -                                   && freezing(p) && !frozen(p))
> +                               if (p != current && freezing(p) && !frozen(p))
>                                         sched_show_task(p);
>                         }
>                         read_unlock(&tasklist_lock);
> @@ -129,7 +127,7 @@ int freeze_processes(void)
>         current->flags |= PF_SUSPEND_TASK;
>
>         if (!pm_freezing)
> -               atomic_inc(&system_freezing_cnt);
> +               static_branch_inc(&freezer_active);
>
>         pm_wakeup_clear(0);
>         pr_info("Freezing user space processes ... ");
> @@ -190,7 +188,7 @@ void thaw_processes(void)
>
>         trace_suspend_resume(TPS("thaw_processes"), 0, true);
>         if (pm_freezing)
> -               atomic_dec(&system_freezing_cnt);
> +               static_branch_dec(&freezer_active);
>         pm_freezing = false;
>         pm_nosig_freezing = false;
>
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -269,7 +269,7 @@ static int ptrace_check_attach(struct ta
>         read_unlock(&tasklist_lock);
>
>         if (!ret && !ignore_state &&
> -           WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED)))
> +           WARN_ON_ONCE(!wait_task_inactive(child, __TASK_TRACED|TASK_FROZEN)))
>                 ret = -ESRCH;
>
>         return ret;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6429,7 +6429,7 @@ static void __sched notrace __schedule(u
>                         prev->sched_contributes_to_load =
>                                 (prev_state & TASK_UNINTERRUPTIBLE) &&
>                                 !(prev_state & TASK_NOLOAD) &&
> -                               !(prev->flags & PF_FROZEN);
> +                               !(prev_state & TASK_FROZEN);
>
>                         if (prev->sched_contributes_to_load)
>                                 rq->nr_uninterruptible++;
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2304,7 +2304,7 @@ static int ptrace_stop(int exit_code, in
>         read_unlock(&tasklist_lock);
>         cgroup_enter_frozen();
>         preempt_enable_no_resched();
> -       freezable_schedule();
> +       schedule();
>         cgroup_leave_frozen(true);
>
>         /*
> @@ -2473,7 +2473,7 @@ static bool do_signal_stop(int signr)
>
>                 /* Now we don't run again until woken by SIGCONT or SIGKILL */
>                 cgroup_enter_frozen();
> -               freezable_schedule();
> +               schedule();
>                 return true;
>         } else {
>                 /*
> @@ -2548,11 +2548,11 @@ static void do_freezer_trap(void)
>          * immediately (if there is a non-fatal signal pending), and
>          * put the task into sleep.
>          */
> -       __set_current_state(TASK_INTERRUPTIBLE);
> +       __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>         clear_thread_flag(TIF_SIGPENDING);
>         spin_unlock_irq(&current->sighand->siglock);
>         cgroup_enter_frozen();
> -       freezable_schedule();
> +       schedule();
>  }
>
>  static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
> @@ -3600,9 +3600,9 @@ static int do_sigtimedwait(const sigset_
>                 recalc_sigpending();
>                 spin_unlock_irq(&tsk->sighand->siglock);
>
> -               __set_current_state(TASK_INTERRUPTIBLE);
> -               ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
> -                                                        HRTIMER_MODE_REL);
> +               __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> +               ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
> +                                              HRTIMER_MODE_REL);
>                 spin_lock_irq(&tsk->sighand->siglock);
>                 __set_task_blocked(tsk, &tsk->real_blocked);
>                 sigemptyset(&tsk->real_blocked);
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -2037,11 +2037,11 @@ static int __sched do_nanosleep(struct h
>         struct restart_block *restart;
>
>         do {
> -               set_current_state(TASK_INTERRUPTIBLE);
> +               set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
>                 hrtimer_sleeper_start_expires(t, mode);
>
>                 if (likely(t->task))
> -                       freezable_schedule();
> +                       schedule();
>
>                 hrtimer_cancel(&t->timer);
>                 mode = HRTIMER_MODE_ABS;
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -404,6 +404,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup)
>   */
>  int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
>  {
> +       unsigned int state = TASK_UNINTERRUPTIBLE;
>         DECLARE_COMPLETION_ONSTACK(done);
>         int retval = 0;
>
> @@ -437,25 +438,22 @@ int call_usermodehelper_exec(struct subp
>         if (wait == UMH_NO_WAIT)        /* task has freed sub_info */
>                 goto unlock;
>
> +       if (wait & UMH_KILLABLE)
> +               state |= TASK_KILLABLE;
> +
>         if (wait & UMH_FREEZABLE)
> -               freezer_do_not_count();
> +               state |= TASK_FREEZABLE;
>
> -       if (wait & UMH_KILLABLE) {
> -               retval = wait_for_completion_killable(&done);
> -               if (!retval)
> -                       goto wait_done;
> +       retval = wait_for_completion_state(&done, state);
> +       if (!retval)
> +               goto wait_done;
>
> +       if (wait & UMH_KILLABLE) {
>                 /* umh_complete() will see NULL and free sub_info */
>                 if (xchg(&sub_info->complete, NULL))
>                         goto unlock;
> -               /* fallthrough, umh_complete() was already called */
>         }
>
> -       wait_for_completion(&done);
> -
> -       if (wait & UMH_FREEZABLE)
> -               freezer_count();
> -
>  wait_done:
>         retval = sub_info->retval;
>  out:
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -730,8 +730,8 @@ static void khugepaged_alloc_sleep(void)
>         DEFINE_WAIT(wait);
>
>         add_wait_queue(&khugepaged_wait, &wait);
> -       freezable_schedule_timeout_interruptible(
> -               msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
> +       __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> +       schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
>         remove_wait_queue(&khugepaged_wait, &wait);
>  }
>
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -269,7 +269,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue
>
>  static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
>  {
> -       freezable_schedule_unsafe();
> +       schedule();
>         if (signal_pending_state(mode, current))
>                 return -ERESTARTSYS;
>         return 0;
> @@ -333,14 +333,12 @@ static int rpc_complete_task(struct rpc_
>   * to enforce taking of the wq->lock and hence avoid races with
>   * rpc_complete_task().
>   */
> -int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
> +int rpc_wait_for_completion_task(struct rpc_task *task)
>  {
> -       if (action == NULL)
> -               action = rpc_wait_bit_killable;
>         return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> -                       action, TASK_KILLABLE);
> +                       rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
>  }
> -EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
> +EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
>
>  /*
>   * Make an RPC task runnable.
> @@ -964,7 +962,7 @@ static void __rpc_execute(struct rpc_tas
>                 trace_rpc_task_sync_sleep(task, task->tk_action);
>                 status = out_of_line_wait_on_bit(&task->tk_runstate,
>                                 RPC_TASK_QUEUED, rpc_wait_bit_killable,
> -                               TASK_KILLABLE);
> +                               TASK_KILLABLE|TASK_FREEZABLE);
>                 if (status < 0) {
>                         /*
>                          * When a sync task receives a signal, it exits with
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2543,13 +2543,14 @@ static long unix_stream_data_wait(struct
>                                   struct sk_buff *last, unsigned int last_len,
>                                   bool freezable)
>  {
> +       unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
>         struct sk_buff *tail;
>         DEFINE_WAIT(wait);
>
>         unix_state_lock(sk);
>
>         for (;;) {
> -               prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +               prepare_to_wait(sk_sleep(sk), &wait, state);
>
>                 tail = skb_peek_tail(&sk->sk_receive_queue);
>                 if (tail != last ||
> @@ -2562,10 +2563,7 @@ static long unix_stream_data_wait(struct
>
>                 sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
>                 unix_state_unlock(sk);
> -               if (freezable)
> -                       timeo = freezable_schedule_timeout(timeo);
> -               else
> -                       timeo = schedule_timeout(timeo);
> +               timeo = schedule_timeout(timeo);
>                 unix_state_lock(sk);
>
>                 if (sock_flag(sk, SOCK_DEAD))
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-08-23 17:32   ` Rafael J. Wysocki
@ 2022-08-26 21:54     ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-08-26 21:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Oleg Nesterov, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Eric Biederman,
	Sebastian Andrzej Siewior, Will Deacon,
	Linux Kernel Mailing List, Tejun Heo, Linux PM

On Tue, Aug 23, 2022 at 07:32:33PM +0200, Rafael J. Wysocki wrote:
> On Mon, Aug 22, 2022 at 1:48 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Allows waiting with a custom @state.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/completion.h |    1 +
> >  kernel/sched/completion.c  |    9 +++++++++
> >  2 files changed, 10 insertions(+)
> >
> > --- a/include/linux/completion.h
> > +++ b/include/linux/completion.h
> > @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
> >  extern void wait_for_completion_io(struct completion *);
> >  extern int wait_for_completion_interruptible(struct completion *x);
> >  extern int wait_for_completion_killable(struct completion *x);
> > +extern int wait_for_completion_state(struct completion *x, unsigned int state);
> >  extern unsigned long wait_for_completion_timeout(struct completion *x,
> >                                                    unsigned long timeout);
> >  extern unsigned long wait_for_completion_io_timeout(struct completion *x,
> > --- a/kernel/sched/completion.c
> > +++ b/kernel/sched/completion.c
> > @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
> >  }
> >  EXPORT_SYMBOL(wait_for_completion_killable);
> >
> > +int __sched wait_for_completion_state(struct completion *x, unsigned int state)
> > +{
> > +       long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
> > +       if (t == -ERESTARTSYS)
> > +               return t;
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL(wait_for_completion_state);
> 
> Why not EXPORT_SYMBOL_GPL?  I guess to match the above?

Yeah; I'm torn between preference and consistency here :-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/6] sched/wait: Add wait_event_state()
  2022-08-22 11:18 ` [PATCH v3 5/6] sched/wait: Add wait_event_state() Peter Zijlstra
@ 2022-09-04  9:54   ` Ingo Molnar
  2022-09-06 11:08     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2022-09-04  9:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> Allows waiting with a custom @state.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/wait.h |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -931,6 +931,34 @@ extern int do_wait_intr_irq(wait_queue_h
>  	__ret;									\
>  })
>  
> +#define __wait_event_state(wq, condition, state)				\
> +	___wait_event(wq, condition, state, 0, 0, schedule())
> +
> +/**
> + * wait_event_state - sleep until a condition gets true
> + * @wq_head: the waitqueue to wait on
> + * @condition: a C expression for the event to wait for
> + * @state: state to sleep in
> + *
> + * The process is put to sleep (@state) until the @condition evaluates to true
> + * or a signal is received.  The @condition is checked each time the waitqueue
> + * @wq_head is woken up.

Documentation inconsistency nit: if TASK_INTERRUPTIBLE isn't in @state then 
we won't wake up when a signal is received. This probably got copy-pasted 
from a signal variant.

> + *
> + * wake_up() has to be called after changing any variable that could
> + * change the result of the wait condition.
> + *
> + * The function will return -ERESTARTSYS if it was interrupted by a
> + * signal and 0 if @condition evaluated to true.

That's not unconditionally true either if !TASK_INTERRUPTIBLE.

> +#define wait_event_state(wq_head, condition, state)				\
> +({										\
> +	int __ret = 0;								\
> +	might_sleep();								\

Very small style consistency nit, the above should have a newline after 
local variables:

> +#define wait_event_state(wq_head, condition, state)				\
> +({										\
> +	int __ret = 0;								\
> +                                                                             \
> +	might_sleep();								\

Like most (but not all ... :-/ ) of the existing primitives have.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  2022-08-23 17:36   ` Rafael J. Wysocki
@ 2022-09-04 10:09   ` Ingo Molnar
  2022-09-06 11:23     ` Peter Zijlstra
  2022-09-23  7:21   ` Christian Borntraeger
  2022-10-21 17:22   ` Ville Syrjälä
  3 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2022-09-04 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -81,25 +81,32 @@ struct task_group;
>   */
>  
>  /* Used in tsk->state: */
> -#define TASK_RUNNING			0x0000
> -#define TASK_INTERRUPTIBLE		0x0001
> -#define TASK_UNINTERRUPTIBLE		0x0002
> -#define __TASK_STOPPED			0x0004
> -#define __TASK_TRACED			0x0008
> +#define TASK_RUNNING			0x000000
> +#define TASK_INTERRUPTIBLE		0x000001
> +#define TASK_UNINTERRUPTIBLE		0x000002
> +#define __TASK_STOPPED			0x000004
> +#define __TASK_TRACED			0x000008
>  /* Used in tsk->exit_state: */
> -#define EXIT_DEAD			0x0010
> -#define EXIT_ZOMBIE			0x0020
> +#define EXIT_DEAD			0x000010
> +#define EXIT_ZOMBIE			0x000020
>  #define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
>  /* Used in tsk->state again: */
> -#define TASK_PARKED			0x0040
> -#define TASK_DEAD			0x0080
> -#define TASK_WAKEKILL			0x0100
> -#define TASK_WAKING			0x0200
> -#define TASK_NOLOAD			0x0400
> -#define TASK_NEW			0x0800
> -/* RT specific auxilliary flag to mark RT lock waiters */
> -#define TASK_RTLOCK_WAIT		0x1000
> -#define TASK_STATE_MAX			0x2000
> +#define TASK_PARKED			0x000040
> +#define TASK_DEAD			0x000080
> +#define TASK_WAKEKILL			0x000100
> +#define TASK_WAKING			0x000200
> +#define TASK_NOLOAD			0x000400
> +#define TASK_NEW			0x000800
> +#define TASK_FREEZABLE			0x001000
> +#define __TASK_FREEZABLE_UNSAFE	       (0x002000 * IS_ENABLED(CONFIG_LOCKDEP))
> +#define TASK_FROZEN			0x004000
> +#define TASK_RTLOCK_WAIT		0x008000
> +#define TASK_STATE_MAX			0x010000

Patch ordering suggestion: would be really nice to first do the width 
adjustment as a preparatory patch, then the real changes. The mixing 
obscures what the patch is doing here, that we leave all bits before 
TASK_NEW unchanged, add in TASK_FREEZABLE, __TASK_FREEZABLE_UNSAFE & 
TASK_FROZEN to before TASK_RTLOCK_WAIT.

Btw., wouldn't it be better to just add the new bits right before 
TASK_STATE_MAX, and leave the existing ones unchanged? I don't think the 
order of TASK_RTLOCK_WAIT is relevant, right?

>  /* Convenience macros for the sake of set_current_state: */
>  #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
> @@ -1714,7 +1721,6 @@ extern struct pid *cad_pid;
>  #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
>  #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
> -#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
>  #define PF_KSWAPD		0x00020000	/* I am kswapd */
>  #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
>  #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */

yay.

BTW., we should probably mark/document all PF_ holes with a PF__RESERVED 
kind of scheme? Something simple, like:

   #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
   #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 + #define PF__RESERVED_04000	0x00004000	/* Unused */
   #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
 + #define PF__RESERVED_10000	0x00010000	/* Unused */
   #define PF_KSWAPD		0x00020000	/* I am kswapd */
   #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
   #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-08-22 11:18 ` [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state Peter Zijlstra
@ 2022-09-04 10:44   ` Ingo Molnar
  2022-09-06 10:54     ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2022-09-04 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> 
> That is, instead of an equal comparison, use it as a mask. This allows
> matching multiple block conditions.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
>  		 * is actually now running somewhere else!
>  		 */
>  		while (task_running(rq, p)) {
> -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> +			if (match_state && !(READ_ONCE(p->__state) & match_state))
>  				return 0;

We lose the unlikely annotation there - but I guess it probably never 
really mattered anyway?

Suggestion #1:

- Shouldn't we rename task_running() to something like task_on_cpu()? The 
  task_running() primitive is similar to TASK_RUNNING but is not based off 
  any TASK_FLAGS.

Suggestion #2:

- Shouldn't we eventually standardize on task->on_cpu on UP kernels too? 
  They don't really matter anymore, and doing so removes #ifdefs and makes 
  the code easier to read.


>  			cpu_relax();
>  		}
> @@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
>  		running = task_running(rq, p);
>  		queued = task_on_rq_queued(p);
>  		ncsw = 0;
> -		if (!match_state || READ_ONCE(p->__state) == match_state)
> +		if (!match_state || (READ_ONCE(p->__state) & match_state))
>  			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
>  		task_rq_unlock(rq, p, &rf);

Suggestion #3:

- Couldn't the following users with a 0 mask:

    drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
    fs/coredump.c:                  wait_task_inactive(ptr->task, 0);

  Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
  !match_state special case?

  They'd do something like:

    drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
    fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);

  It's not an entirely 100% equivalent transformation though, but looks OK 
  at first sight: ->__state will be some nonzero mask for genuine tasks 
  waiting to schedule out, so any match will be functionally the same as a 
  0 flag telling us not to check any of the bits, right? I might be missing 
  something though.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-08-22 11:18 ` [PATCH v3 4/6] sched/completion: Add wait_for_completion_state() Peter Zijlstra
  2022-08-23 17:32   ` Rafael J. Wysocki
@ 2022-09-04 10:46   ` Ingo Molnar
  2022-09-06 10:24     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2022-09-04 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> Allows waiting with a custom @state.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/completion.h |    1 +
>  kernel/sched/completion.c  |    9 +++++++++
>  2 files changed, 10 insertions(+)
> 
> --- a/include/linux/completion.h
> +++ b/include/linux/completion.h
> @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
>  extern void wait_for_completion_io(struct completion *);
>  extern int wait_for_completion_interruptible(struct completion *x);
>  extern int wait_for_completion_killable(struct completion *x);
> +extern int wait_for_completion_state(struct completion *x, unsigned int state);
>  extern unsigned long wait_for_completion_timeout(struct completion *x,
>  						   unsigned long timeout);
>  extern unsigned long wait_for_completion_io_timeout(struct completion *x,
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
>  }
>  EXPORT_SYMBOL(wait_for_completion_killable);
>  
> +int __sched wait_for_completion_state(struct completion *x, unsigned int state)
> +{
> +	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
> +	if (t == -ERESTARTSYS)
> +		return t;
> +	return 0;

Nit: newline missing after local variable definition.

Other than that:

Reviewed-by: Ingo Molnar <mingo@kernel.org>


Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-09-04 10:46   ` Ingo Molnar
@ 2022-09-06 10:24     ` Peter Zijlstra
  2022-09-07  7:35       ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-06 10:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Sun, Sep 04, 2022 at 12:46:11PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Allows waiting with a custom @state.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/completion.h |    1 +
> >  kernel/sched/completion.c  |    9 +++++++++
> >  2 files changed, 10 insertions(+)
> > 
> > --- a/include/linux/completion.h
> > +++ b/include/linux/completion.h
> > @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
> >  extern void wait_for_completion_io(struct completion *);
> >  extern int wait_for_completion_interruptible(struct completion *x);
> >  extern int wait_for_completion_killable(struct completion *x);
> > +extern int wait_for_completion_state(struct completion *x, unsigned int state);
> >  extern unsigned long wait_for_completion_timeout(struct completion *x,
> >  						   unsigned long timeout);
> >  extern unsigned long wait_for_completion_io_timeout(struct completion *x,
> > --- a/kernel/sched/completion.c
> > +++ b/kernel/sched/completion.c
> > @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
> >  }
> >  EXPORT_SYMBOL(wait_for_completion_killable);
> >  
> > +int __sched wait_for_completion_state(struct completion *x, unsigned int state)
> > +{
> > +	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
> > +	if (t == -ERESTARTSYS)
> > +		return t;
> > +	return 0;
> 
> Nit: newline missing after local variable definition.

Yah, I know, but all the other similar functions there have the same
defect. I don't much like whitespace patches, so I figured I'd be
consistent and let it all be for now.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-09-04 10:44   ` Ingo Molnar
@ 2022-09-06 10:54     ` Peter Zijlstra
  2022-09-07  7:23       ` Ingo Molnar
                         ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-06 10:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Sun, Sep 04, 2022 at 12:44:36PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> > 
> > That is, instead of an equal comparison, use it as a mask. This allows
> > matching multiple block conditions.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
> >  		 * is actually now running somewhere else!
> >  		 */
> >  		while (task_running(rq, p)) {
> > -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> > +			if (match_state && !(READ_ONCE(p->__state) & match_state))
> >  				return 0;
> 
> We lose the unlikely annotation there - but I guess it probably never 
> really mattered anyway?

So any wait_task_inactive() caller does want that case to be true, but
the whole match_state precondition mostly wrecks things anyway. If
anything it should've been:

		if (likely(match_state && !(READ_ONCE(p->__state) & match_state)))
			return 0;

but I can't find it in me to care too much here.

> Suggestion #1:
> 
> - Shouldn't we rename task_running() to something like task_on_cpu()? The 
>   task_running() primitive is similar to TASK_RUNNING but is not based off 
>   any TASK_FLAGS.

That looks like a simple enough patch, lemme go do that.

> Suggestion #2:
> 
> - Shouldn't we eventually standardize on task->on_cpu on UP kernels too? 
>   They don't really matter anymore, and doing so removes #ifdefs and makes 
>   the code easier to read.

Probably, but that sounds like something that'll spiral out of control
real quick, so I'll leave that on the TODO list somewhere.

> >  			cpu_relax();
> >  		}
> > @@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
> >  		running = task_running(rq, p);
> >  		queued = task_on_rq_queued(p);
> >  		ncsw = 0;
> > -		if (!match_state || READ_ONCE(p->__state) == match_state)
> > +		if (!match_state || (READ_ONCE(p->__state) & match_state))
> >  			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
> >  		task_rq_unlock(rq, p, &rf);
> 
> Suggestion #3:
> 
> - Couldn't the following users with a 0 mask:
> 
>     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
>     fs/coredump.c:                  wait_task_inactive(ptr->task, 0);
> 
>   Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
>   !match_state special case?
> 
>   They'd do something like:
> 
>     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
>     fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);
> 
>   It's not an entirely 100% equivalent transformation though, but looks OK 
>   at first sight: ->__state will be some nonzero mask for genuine tasks 
>   waiting to schedule out, so any match will be functionally the same as a 
>   0 flag telling us not to check any of the bits, right? I might be missing 
>   something though.

I too am thinking that should work. Added patch for that.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/6] sched/wait: Add wait_event_state()
  2022-09-04  9:54   ` Ingo Molnar
@ 2022-09-06 11:08     ` Peter Zijlstra
  2022-09-07  7:26       ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-06 11:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Sun, Sep 04, 2022 at 11:54:57AM +0200, Ingo Molnar wrote:
> > +/**
> > + * wait_event_state - sleep until a condition gets true
> > + * @wq_head: the waitqueue to wait on
> > + * @condition: a C expression for the event to wait for
> > + * @state: state to sleep in
> > + *
> > + * The process is put to sleep (@state) until the @condition evaluates to true
> > + * or a signal is received.  The @condition is checked each time the waitqueue
> > + * @wq_head is woken up.
> 
> Documentation inconsistency nit: if TASK_INTERRUPTIBLE isn't in @state then 
> we won't wake up when a signal is received. This probably got copy-pasted 
> from a signal variant.
> 
> > + *
> > + * wake_up() has to be called after changing any variable that could
> > + * change the result of the wait condition.
> > + *
> > + * The function will return -ERESTARTSYS if it was interrupted by a
> > + * signal and 0 if @condition evaluated to true.
> 
> That's not unconditionally true either if !TASK_INTERRUPTIBLE.


--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -942,14 +942,14 @@ extern int do_wait_intr_irq(wait_queue_h
  * @state: state to sleep in
  *
  * The process is put to sleep (@state) until the @condition evaluates to true
- * or a signal is received.  The @condition is checked each time the waitqueue
- * @wq_head is woken up.
+ * or a signal is received (when allowed by @state).  The @condition is checked
+ * each time the waitqueue @wq_head is woken up.
  *
  * wake_up() has to be called after changing any variable that could
  * change the result of the wait condition.
  *
- * The function will return -ERESTARTSYS if it was interrupted by a
- * signal and 0 if @condition evaluated to true.
+ * The function will return -ERESTARTSYS if it was interrupted by a signal
+ * (when allowed by @state) and 0 if @condition evaluated to true.
  */
 #define wait_event_state(wq_head, condition, state)				\
 ({										\


> > +#define wait_event_state(wq_head, condition, state)				\
> > +({										\
> > +	int __ret = 0;								\
> > +	might_sleep();								\
> 
> Very small style consistency nit, the above should have a newline after 
> local variables:
> 
> > +#define wait_event_state(wq_head, condition, state)				\
> > +({										\
> > +	int __ret = 0;								\
> > +                                                                             \
> > +	might_sleep();								\
> 
> Like most (but not all ... :-/ ) of the existing primitives have.

Yeah, I'm going to leave it as is.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-04 10:09   ` Ingo Molnar
@ 2022-09-06 11:23     ` Peter Zijlstra
  2022-09-07  7:30       ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-06 11:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Sun, Sep 04, 2022 at 12:09:37PM +0200, Ingo Molnar wrote:

> BTW., we should probably mark/document all PF_ holes with a PF__RESERVED 
> kind of scheme? Something simple, like:
> 
>    #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
>    #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
>  + #define PF__RESERVED_04000	0x00004000	/* Unused */
>    #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
>  + #define PF__RESERVED_10000	0x00010000	/* Unused */
>    #define PF_KSWAPD		0x00020000	/* I am kswapd */
>    #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
>    #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */


How's this then, it immediately shows how holey it is :-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1722,7 +1722,9 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC		0x00000800	/* Allocating memory */
 #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
+#define PF__HOLE__00004000	0x00004000	/* A HOLE */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
+#define PF__HOLE__00010000	0x00010000	/* A HOLE */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1730,9 +1732,14 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
+#define PF__HOLE__00800000	0x00800000	/* A HOLE */
+#define PF__HOLE__01000000	0x01000000	/* A HOLE */
+#define PF__HOLE__02000000	0x02000000	/* A HOLE */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
+#define PF__HOLE__20000000	0x20000000	/* A HOLE */
+#define PF__HOLE__40000000	0x40000000	/* A HOLE */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-09-06 10:54     ` Peter Zijlstra
@ 2022-09-07  7:23       ` Ingo Molnar
  2022-09-07  9:29       ` Peter Zijlstra
  2022-09-07  9:30       ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2022-09-07  7:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 04, 2022 at 12:44:36PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Make wait_task_inactive()'s @match_state work like ttwu()'s @state.
> > > 
> > > That is, instead of an equal comparison, use it as a mask. This allows
> > > matching multiple block conditions.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/core.c |    4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
> > >  		 * is actually now running somewhere else!
> > >  		 */
> > >  		while (task_running(rq, p)) {
> > > -			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
> > > +			if (match_state && !(READ_ONCE(p->__state) & match_state))
> > >  				return 0;
> > 
> > We lose the unlikely annotation there - but I guess it probably never 
> > really mattered anyway?
> 
> So any wait_task_inactive() caller does want that case to be true, but
> the whole match_state precondition mostly wrecks things anyway. If
> anything it should've been:
> 
> 		if (likely(match_state && !(READ_ONCE(p->__state) & match_state)))
> 			return 0;
> 
> but I can't find it in me to care too much here.

Yeah, I agree that this is probably the most likely branch - and default 
compiler code generation behavior should be pretty close to that to begin 
with.

Ie. ack on dropping the unlikely() annotation. :-)

Might make sense to add a sentence to the changelog though, in case anyone 
(like me) is wondering about whether the dropped annotation was intended.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 5/6] sched/wait: Add wait_event_state()
  2022-09-06 11:08     ` Peter Zijlstra
@ 2022-09-07  7:26       ` Ingo Molnar
  0 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2022-09-07  7:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 04, 2022 at 11:54:57AM +0200, Ingo Molnar wrote:
> > > +/**
> > > + * wait_event_state - sleep until a condition gets true
> > > + * @wq_head: the waitqueue to wait on
> > > + * @condition: a C expression for the event to wait for
> > > + * @state: state to sleep in
> > > + *
> > > + * The process is put to sleep (@state) until the @condition evaluates to true
> > > + * or a signal is received.  The @condition is checked each time the waitqueue
> > > + * @wq_head is woken up.
> > 
> > Documentation inconsistency nit: if TASK_INTERRUPTIBLE isn't in @state then 
> > we won't wake up when a signal is received. This probably got copy-pasted 
> > from a signal variant.
> > 
> > > + *
> > > + * wake_up() has to be called after changing any variable that could
> > > + * change the result of the wait condition.
> > > + *
> > > + * The function will return -ERESTARTSYS if it was interrupted by a
> > > + * signal and 0 if @condition evaluated to true.
> > 
> > That's not unconditionally true either if !TASK_INTERRUPTIBLE.
> 
> 
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -942,14 +942,14 @@ extern int do_wait_intr_irq(wait_queue_h
>   * @state: state to sleep in
>   *
>   * The process is put to sleep (@state) until the @condition evaluates to true
> - * or a signal is received.  The @condition is checked each time the waitqueue
> - * @wq_head is woken up.
> + * or a signal is received (when allowed by @state).  The @condition is checked
> + * each time the waitqueue @wq_head is woken up.
>   *
>   * wake_up() has to be called after changing any variable that could
>   * change the result of the wait condition.
>   *
> - * The function will return -ERESTARTSYS if it was interrupted by a
> - * signal and 0 if @condition evaluated to true.
> + * The function will return -ERESTARTSYS if it was interrupted by a signal
> + * (when allowed by @state) and 0 if @condition evaluated to true.
>   */

Reviewed-by: Ingo Molnar <mingo@kernel.org>

> > > +#define wait_event_state(wq_head, condition, state)				\
> > > +({										\
> > > +	int __ret = 0;								\
> > > +                                                                             \
> > > +	might_sleep();								\
> > 
> > Like most (but not all ... :-/ ) of the existing primitives have.
> 
> Yeah, I'm going to leave it as is.

Will queue up a cleanup patch, should I ever notice this detail again ... :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-06 11:23     ` Peter Zijlstra
@ 2022-09-07  7:30       ` Ingo Molnar
  0 siblings, 0 replies; 59+ messages in thread
From: Ingo Molnar @ 2022-09-07  7:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 04, 2022 at 12:09:37PM +0200, Ingo Molnar wrote:
> 
> > BTW., we should probably mark/document all PF_ holes with a PF__RESERVED 
> > kind of scheme? Something simple, like:
> > 
> >    #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
> >    #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
> >  + #define PF__RESERVED_04000	0x00004000	/* Unused */
> >    #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
> >  + #define PF__RESERVED_10000	0x00010000	/* Unused */
> >    #define PF_KSWAPD		0x00020000	/* I am kswapd */
> >    #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
> >    #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
> 
> 
> How's this then, it immediately shows how holey it is :-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1722,7 +1722,9 @@ extern struct pid *cad_pid;
>  #define PF_MEMALLOC		0x00000800	/* Allocating memory */
>  #define PF_NPROC_EXCEEDED	0x00001000	/* set_user() noticed that RLIMIT_NPROC was exceeded */
>  #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
> +#define PF__HOLE__00004000	0x00004000	/* A HOLE */
>  #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
> +#define PF__HOLE__00010000	0x00010000	/* A HOLE */
>  #define PF_KSWAPD		0x00020000	/* I am kswapd */
>  #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
>  #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
> @@ -1730,9 +1732,14 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> +#define PF__HOLE__00800000	0x00800000	/* A HOLE */
> +#define PF__HOLE__01000000	0x01000000	/* A HOLE */
> +#define PF__HOLE__02000000	0x02000000	/* A HOLE */
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
>  #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
> +#define PF__HOLE__20000000	0x20000000	/* A HOLE */
> +#define PF__HOLE__40000000	0x40000000	/* A HOLE */
>  #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */

LGTM - OTOH this looks quite a bit more cluttery than I imagined it in my 
head. :-/ So I'd leave out the comment part at minimum. With that:

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-09-06 10:24     ` Peter Zijlstra
@ 2022-09-07  7:35       ` Ingo Molnar
  2022-09-07  9:24         ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2022-09-07  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Sep 04, 2022 at 12:46:11PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Allows waiting with a custom @state.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  include/linux/completion.h |    1 +
> > >  kernel/sched/completion.c  |    9 +++++++++
> > >  2 files changed, 10 insertions(+)
> > > 
> > > --- a/include/linux/completion.h
> > > +++ b/include/linux/completion.h
> > > @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
> > >  extern void wait_for_completion_io(struct completion *);
> > >  extern int wait_for_completion_interruptible(struct completion *x);
> > >  extern int wait_for_completion_killable(struct completion *x);
> > > +extern int wait_for_completion_state(struct completion *x, unsigned int state);
> > >  extern unsigned long wait_for_completion_timeout(struct completion *x,
> > >  						   unsigned long timeout);
> > >  extern unsigned long wait_for_completion_io_timeout(struct completion *x,
> > > --- a/kernel/sched/completion.c
> > > +++ b/kernel/sched/completion.c
> > > @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
> > >  }
> > >  EXPORT_SYMBOL(wait_for_completion_killable);
> > >  
> > > +int __sched wait_for_completion_state(struct completion *x, unsigned int state)
> > > +{
> > > +	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
> > > +	if (t == -ERESTARTSYS)
> > > +		return t;
> > > +	return 0;
> > 
> > Nit: newline missing after local variable definition.
> 
> Yah, I know, but all the other similar functions there have the same
> defect. I don't much like whitespace patches, so I figured I'd be
> consistent and let it all be for now.

That's not actually true: there's ~7 functions in kernel/sched/completion.c 
with local variables, and only ~2 have this minor stylistic inconsistency 
right now AFAICS. Scheduler-wide the ratio is even lower.

So even if a patch doesn't entirely remove the residual noise, let's not 
add to it, please?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 4/6] sched/completion: Add wait_for_completion_state()
  2022-09-07  7:35       ` Ingo Molnar
@ 2022-09-07  9:24         ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-07  9:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Wed, Sep 07, 2022 at 09:35:20AM +0200, Ingo Molnar wrote:
> That's not actually true: there's ~7 functions in kernel/sched/completion.c 
> with local variables, and only ~2 have this minor stylistic inconsistency 
> right now AFAICS. Scheduler-wide the ratio is even lower.
> 
> So even if a patch doesn't entirely remove the residual noise, let's not 
> add to it, please?


--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -204,6 +204,7 @@ EXPORT_SYMBOL(wait_for_completion_io_tim
 int __sched wait_for_completion_interruptible(struct completion *x)
 {
 	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_INTERRUPTIBLE);
+
 	if (t == -ERESTARTSYS)
 		return t;
 	return 0;
@@ -241,6 +242,7 @@ EXPORT_SYMBOL(wait_for_completion_interr
 int __sched wait_for_completion_killable(struct completion *x)
 {
 	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_KILLABLE);
+
 	if (t == -ERESTARTSYS)
 		return t;
 	return 0;
@@ -250,6 +252,7 @@ EXPORT_SYMBOL(wait_for_completion_killab
 int __sched wait_for_completion_state(struct completion *x, unsigned int state)
 {
 	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
+
 	if (t == -ERESTARTSYS)
 		return t;
 	return 0;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-09-06 10:54     ` Peter Zijlstra
  2022-09-07  7:23       ` Ingo Molnar
@ 2022-09-07  9:29       ` Peter Zijlstra
  2022-09-07  9:30       ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-07  9:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Tue, Sep 06, 2022 at 12:54:34PM +0200, Peter Zijlstra wrote:

> > Suggestion #1:
> > 
> > - Shouldn't we rename task_running() to something like task_on_cpu()? The 
> >   task_running() primitive is similar to TASK_RUNNING but is not based off 
> >   any TASK_FLAGS.
> 
> That looks like a simple enough patch, lemme go do that.

---
Subject: sched: Rename task_running() to task_on_cpu()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:33:04 CEST 2022

There is some ambiguity about task_running() in that it is unrelated
to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
task_on_cpu().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c       |   10 +++++-----
 kernel/sched/core_sched.c |    2 +-
 kernel/sched/deadline.c   |    6 +++---
 kernel/sched/fair.c       |    2 +-
 kernel/sched/rt.c         |    6 +++---
 kernel/sched/sched.h      |    2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2778,7 +2778,7 @@ static int affine_move_task(struct rq *r
 		return -EINVAL;
 	}
 
-	if (task_running(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
+	if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
 		/*
 		 * MIGRATE_ENABLE gets here because 'p == current', but for
 		 * anything else we cannot do is_migration_disabled(), punt
@@ -3290,11 +3290,11 @@ unsigned long wait_task_inactive(struct
 		 *
 		 * NOTE! Since we don't hold any locks, it's not
 		 * even sure that "rq" stays as the right runqueue!
-		 * But we don't care, since "task_running()" will
+		 * But we don't care, since "task_on_cpu()" will
 		 * return false if the runqueue has changed and p
 		 * is actually now running somewhere else!
 		 */
-		while (task_running(rq, p)) {
+		while (task_on_cpu(rq, p)) {
 			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
 				return 0;
 			cpu_relax();
@@ -3307,7 +3307,7 @@ unsigned long wait_task_inactive(struct
 		 */
 		rq = task_rq_lock(p, &rf);
 		trace_sched_wait_task(p);
-		running = task_running(rq, p);
+		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
 		if (!match_state || READ_ONCE(p->__state) == match_state)
@@ -8649,7 +8649,7 @@ int __sched yield_to(struct task_struct
 	if (curr->sched_class != p->sched_class)
 		goto out_unlock;
 
-	if (task_running(p_rq, p) || !task_is_running(p))
+	if (task_on_cpu(p_rq, p) || !task_is_running(p))
 		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p);
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -88,7 +88,7 @@ static unsigned long sched_core_update_c
 	 * core has now entered/left forced idle state. Defer accounting to the
 	 * next scheduling edge, rather than always forcing a reschedule here.
 	 */
-	if (task_running(rq, p))
+	if (task_on_cpu(rq, p))
 		resched_curr(rq);
 
 	task_rq_unlock(rq, p, &rf);
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2087,7 +2087,7 @@ static void task_fork_dl(struct task_str
 
 static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    cpumask_test_cpu(cpu, &p->cpus_mask))
 		return 1;
 	return 0;
@@ -2241,7 +2241,7 @@ static struct rq *find_lock_later_rq(str
 		if (double_lock_balance(rq, later_rq)) {
 			if (unlikely(task_rq(task) != rq ||
 				     !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
-				     task_running(rq, task) ||
+				     task_on_cpu(rq, task) ||
 				     !dl_task(task) ||
 				     !task_on_rq_queued(task))) {
 				double_unlock_balance(rq, later_rq);
@@ -2475,7 +2475,7 @@ static void pull_dl_task(struct rq *this
  */
 static void task_woken_dl(struct rq *rq, struct task_struct *p)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
 	    dl_task(rq->curr) &&
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7938,7 +7938,7 @@ int can_migrate_task(struct task_struct
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_running(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1849,7 +1849,7 @@ static void put_prev_task_rt(struct rq *
 
 static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_running(rq, p) &&
+	if (!task_on_cpu(rq, p) &&
 	    cpumask_test_cpu(cpu, &p->cpus_mask))
 		return 1;
 
@@ -2004,7 +2004,7 @@ static struct rq *find_lock_lowest_rq(st
 			 */
 			if (unlikely(task_rq(task) != rq ||
 				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task_running(rq, task) ||
+				     task_on_cpu(rq, task) ||
 				     !rt_task(task) ||
 				     !task_on_rq_queued(task))) {
 
@@ -2462,7 +2462,7 @@ static void pull_rt_task(struct rq *this
  */
 static void task_woken_rt(struct rq *rq, struct task_struct *p)
 {
-	bool need_to_push = !task_running(rq, p) &&
+	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
 			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2060,7 +2060,7 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-static inline int task_running(struct rq *rq, struct task_struct *p)
+static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
 	return p->on_cpu;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state
  2022-09-06 10:54     ` Peter Zijlstra
  2022-09-07  7:23       ` Ingo Molnar
  2022-09-07  9:29       ` Peter Zijlstra
@ 2022-09-07  9:30       ` Peter Zijlstra
  2 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-07  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: rjw, oleg, vincent.guittot, dietmar.eggemann, rostedt, mgorman,
	ebiederm, bigeasy, Will Deacon, linux-kernel, tj, linux-pm

On Tue, Sep 06, 2022 at 12:54:34PM +0200, Peter Zijlstra wrote:

> > Suggestion #3:
> > 
> > - Couldn't the following users with a 0 mask:
> > 
> >     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, 0);
> >     fs/coredump.c:                  wait_task_inactive(ptr->task, 0);
> > 
> >   Use ~0 instead (exposed as TASK_ANY or so) and then we can drop the
> >   !match_state special case?
> > 
> >   They'd do something like:
> > 
> >     drivers/powercap/idle_inject.c:         wait_task_inactive(iit->tsk, TASK_ANY);
> >     fs/coredump.c:                  wait_task_inactive(ptr->task, TASK_ANY);
> > 
> >   It's not an entirely 100% equivalent transformation though, but looks OK 
> >   at first sight: ->__state will be some nonzero mask for genuine tasks 
> >   waiting to schedule out, so any match will be functionally the same as a 
> >   0 flag telling us not to check any of the bits, right? I might be missing 
> >   something though.
> 
> I too am thinking that should work. Added patch for that.

---
Subject: sched: Add TASK_ANY for wait_task_inactive()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 6 12:39:55 CEST 2022

Now that wait_task_inactive()'s @match_state argument is a mask (like
ttwu()) it is possible to replace the special !match_state case with
an 'all-states' value such that any blocked state will match.

Suggested-by: Ingo Molnar (mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/powercap/idle_inject.c |    2 +-
 fs/coredump.c                  |    2 +-
 include/linux/sched.h          |    2 ++
 kernel/sched/core.c            |   16 ++++++++--------
 4 files changed, 12 insertions(+), 10 deletions(-)

--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -254,7 +254,7 @@ void idle_inject_stop(struct idle_inject
 		iit = per_cpu_ptr(&idle_inject_thread, cpu);
 		iit->should_run = 0;
 
-		wait_task_inactive(iit->tsk, 0);
+		wait_task_inactive(iit->tsk, TASK_ANY);
 	}
 
 	cpu_hotplug_enable();
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -412,7 +412,7 @@ static int coredump_wait(int exit_code,
 		 */
 		ptr = core_state->dumper.next;
 		while (ptr != NULL) {
-			wait_task_inactive(ptr->task, 0);
+			wait_task_inactive(ptr->task, TASK_ANY);
 			ptr = ptr->next;
 		}
 	}
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -101,6 +101,8 @@ struct task_group;
 #define TASK_RTLOCK_WAIT		0x1000
 #define TASK_STATE_MAX			0x2000
 
+#define TASK_ANY			(TASK_STATE_MAX-1)
+
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
 #define TASK_STOPPED			(TASK_WAKEKILL | __TASK_STOPPED)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3254,12 +3254,12 @@ int migrate_swap(struct task_struct *cur
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
- * If @match_state is nonzero, it's the @p->state value just checked and
- * not expected to change.  If it changes, i.e. @p might have woken up,
- * then return zero.  When we succeed in waiting for @p to be off its CPU,
- * we return a positive number (its total switch count).  If a second call
- * a short while later returns the same number, the caller can be sure that
- * @p has remained unscheduled the whole time.
+ * Wait for the thread to block in any of the states set in @match_state.
+ * If it changes, i.e. @p might have woken up, then return zero.  When we
+ * succeed in waiting for @p to be off its CPU, we return a positive number
+ * (its total switch count).  If a second call a short while later returns the
+ * same number, the caller can be sure that @p has remained unscheduled the
+ * whole time.
  *
  * The caller must ensure that the task *will* unschedule sometime soon,
  * else this function might spin for a *long* time. This function can't
@@ -3295,7 +3295,7 @@ unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_on_cpu(rq, p)) {
-			if (match_state && !(READ_ONCE(p->__state) & match_state))
+			if (!(READ_ONCE(p->__state) & match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3310,7 +3310,7 @@ unsigned long wait_task_inactive(struct
 		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || (READ_ONCE(p->__state) & match_state))
+		if (READ_ONCE(p->__state) & match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);
 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  2022-08-23 17:36   ` Rafael J. Wysocki
  2022-09-04 10:09   ` Ingo Molnar
@ 2022-09-23  7:21   ` Christian Borntraeger
  2022-09-23  7:53     ` Christian Borntraeger
  2022-10-21 17:22   ` Ville Syrjälä
  3 siblings, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-23  7:21 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	borntraeger, Marc Hartmayer

Peter, 

as a heads-up. This commit (bisected and verified) triggers a
regression in our KVM on s390x CI. The symptom is that a specific
testcase (start a guest with next kernel and a poky ramdisk,
then ssh via vsock into the guest and run the reboot command) now
takes much longer (300 instead of 20 seconds). From a first look
it seems that the sshd takes very long to end during shutdown
but I have not looked into that yet.
Any quick idea?

Christian 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-23  7:21   ` Christian Borntraeger
@ 2022-09-23  7:53     ` Christian Borntraeger
  2022-09-26  8:06       ` Christian Borntraeger
  0 siblings, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-23  7:53 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer

Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
> Peter,
> 
> as a heads-up. This commit (bisected and verified) triggers a
> regression in our KVM on s390x CI. The symptom is that a specific
> testcase (start a guest with next kernel and a poky ramdisk,
> then ssh via vsock into the guest and run the reboot command) now
> takes much longer (300 instead of 20 seconds). From a first look
> it seems that the sshd takes very long to end during shutdown
> but I have not looked into that yet.
> Any quick idea?
> 
> Christian

the sshd seems to hang in virtio-serial (not vsock).

PID: 237      TASK: 81d1a100          CPU: 1    COMMAND: "sshd"
  LOWCORE INFO:
   -psw      : 0x0404e00180000000 0x0000000131ceb136
   -function : __list_add_valid at 131ceb136
   -prefix   : 0x00410000
   -cpu timer: 0x7fffffd3ec4f33d4
   -clock cmp: 0x2639f08006283e00
   -general registers:
      0x00000008dcea2dce 0x00000001387d44b8
      0x0000000081d1a228 0x00000001387d44b8
      0x00000001387d44b8 0x00000001387d44b8
      0x00000001387d3800 0x00000001387d3700
      0x0000000081d1a100 0x00000001387d44b8
      0x00000001387d44b8 0x0000000081d1a228
      0x0000000081d1a100 0x0000000081d1a100
      0x0000000131608b32 0x00000380004b7aa8
   -access registers:
      0x000003ff 0x8fff5870 0000000000 0000000000
      0000000000 0000000000 0000000000 0000000000
      0000000000 0000000000 0000000000 0000000000
      0000000000 0000000000 0000000000 0000000000
   -control registers:
      0x00a0000014966a10 0x0000000133348007
      0x00000000028c6140 000000000000000000
      0x000000000000ffff 0x00000000028c6140
      0x0000000033000000 0x0000000081f001c7
      0x0000000000008000 000000000000000000
      000000000000000000 000000000000000000
      000000000000000000 0x0000000133348007
      0x00000000db000000 0x00000000028c6000
   -floating point registers:
      0x000003ffb82a9761 0x0000006400000000
      0x000003ffb82a345c 000000000000000000
      0x0000000000007fff 0x000003ffe22fe000
      000000000000000000 0x000003ffe22fa51c
      0x000003ffb81889c0 000000000000000000
      0x000002aa3ce2b470 000000000000000000
      000000000000000000 000000000000000000
      000000000000000000 000000000000000000

  #0 [380004b7b00] pick_next_task at 1315f2088
  #1 [380004b7b98] __schedule at 13215e954
  #2 [380004b7c08] schedule at 13215eeea
  #3 [380004b7c38] wait_port_writable at 3ff80149b2e [virtio_console]
  #4 [380004b7cc0] port_fops_write at 3ff8014a282 [virtio_console]
  #5 [380004b7d40] vfs_write at 131889e3c
  #6 [380004b7e00] ksys_write at 13188a2e8
  #7 [380004b7e50] __do_syscall at 13215761c
  #8 [380004b7e98] system_call at 132166332
  PSW:  0705000180000000 000003ff8f8f3a2a (user space)
  GPRS: 0000000000000015 000003ff00000000 ffffffffffffffda 000002aa02fc68a0
        0000000000000015 0000000000000015 0000000000000000 000002aa02fc68a0
        0000000000000004 000003ff8f8f3a08 0000000000000015 0000000000000000
        000003ff8ffa9f58 0000000000000000 000002aa02365b20 000003ffdf4798d0

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-23  7:53     ` Christian Borntraeger
@ 2022-09-26  8:06       ` Christian Borntraeger
  2022-09-26 10:55         ` Christian Borntraeger
  0 siblings, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26  8:06 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer



Am 23.09.22 um 09:53 schrieb Christian Borntraeger:
> Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
>> Peter,
>>
>> as a heads-up. This commit (bisected and verified) triggers a
>> regression in our KVM on s390x CI. The symptom is that a specific
>> testcase (start a guest with next kernel and a poky ramdisk,
>> then ssh via vsock into the guest and run the reboot command) now
>> takes much longer (300 instead of 20 seconds). From a first look
>> it seems that the sshd takes very long to end during shutdown
>> but I have not looked into that yet.
>> Any quick idea?
>>
>> Christian
> 
> the sshd seems to hang in virtio-serial (not vsock).

FWIW, sshd does not seem to hang, instead it seems to busy loop in
wait_port_writable calling into the scheduler over and over again.


> 
> PID: 237      TASK: 81d1a100          CPU: 1    COMMAND: "sshd"
>   LOWCORE INFO:
>    -psw      : 0x0404e00180000000 0x0000000131ceb136
>    -function : __list_add_valid at 131ceb136
>    -prefix   : 0x00410000
>    -cpu timer: 0x7fffffd3ec4f33d4
>    -clock cmp: 0x2639f08006283e00
>    -general registers:
>       0x00000008dcea2dce 0x00000001387d44b8
>       0x0000000081d1a228 0x00000001387d44b8
>       0x00000001387d44b8 0x00000001387d44b8
>       0x00000001387d3800 0x00000001387d3700
>       0x0000000081d1a100 0x00000001387d44b8
>       0x00000001387d44b8 0x0000000081d1a228
>       0x0000000081d1a100 0x0000000081d1a100
>       0x0000000131608b32 0x00000380004b7aa8
>    -access registers:
>       0x000003ff 0x8fff5870 0000000000 0000000000
>       0000000000 0000000000 0000000000 0000000000
>       0000000000 0000000000 0000000000 0000000000
>       0000000000 0000000000 0000000000 0000000000
>    -control registers:
>       0x00a0000014966a10 0x0000000133348007
>       0x00000000028c6140 000000000000000000
>       0x000000000000ffff 0x00000000028c6140
>       0x0000000033000000 0x0000000081f001c7
>       0x0000000000008000 000000000000000000
>       000000000000000000 000000000000000000
>       000000000000000000 0x0000000133348007
>       0x00000000db000000 0x00000000028c6000
>    -floating point registers:
>       0x000003ffb82a9761 0x0000006400000000
>       0x000003ffb82a345c 000000000000000000
>       0x0000000000007fff 0x000003ffe22fe000
>       000000000000000000 0x000003ffe22fa51c
>       0x000003ffb81889c0 000000000000000000
>       0x000002aa3ce2b470 000000000000000000
>       000000000000000000 000000000000000000
>       000000000000000000 000000000000000000
> 
>   #0 [380004b7b00] pick_next_task at 1315f2088
>   #1 [380004b7b98] __schedule at 13215e954
>   #2 [380004b7c08] schedule at 13215eeea
>   #3 [380004b7c38] wait_port_writable at 3ff80149b2e [virtio_console]
>   #4 [380004b7cc0] port_fops_write at 3ff8014a282 [virtio_console]
>   #5 [380004b7d40] vfs_write at 131889e3c
>   #6 [380004b7e00] ksys_write at 13188a2e8
>   #7 [380004b7e50] __do_syscall at 13215761c
>   #8 [380004b7e98] system_call at 132166332
>   PSW:  0705000180000000 000003ff8f8f3a2a (user space)
>   GPRS: 0000000000000015 000003ff00000000 ffffffffffffffda 000002aa02fc68a0
>         0000000000000015 0000000000000015 0000000000000000 000002aa02fc68a0
>         0000000000000004 000003ff8f8f3a08 0000000000000015 0000000000000000
>         000003ff8ffa9f58 0000000000000000 000002aa02365b20 000003ffdf4798d0

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26  8:06       ` Christian Borntraeger
@ 2022-09-26 10:55         ` Christian Borntraeger
  2022-09-26 12:13           ` Peter Zijlstra
  2022-09-26 12:32           ` Christian Borntraeger
  0 siblings, 2 replies; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26 10:55 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer



Am 26.09.22 um 10:06 schrieb Christian Borntraeger:
> 
> 
> Am 23.09.22 um 09:53 schrieb Christian Borntraeger:
>> Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
>>> Peter,
>>>
>>> as a heads-up. This commit (bisected and verified) triggers a
>>> regression in our KVM on s390x CI. The symptom is that a specific
>>> testcase (start a guest with next kernel and a poky ramdisk,
>>> then ssh via vsock into the guest and run the reboot command) now
>>> takes much longer (300 instead of 20 seconds). From a first look
>>> it seems that the sshd takes very long to end during shutdown
>>> but I have not looked into that yet.
>>> Any quick idea?
>>>
>>> Christian
>>
>> the sshd seems to hang in virtio-serial (not vsock).
> 
> FWIW, sshd does not seem to hang, instead it seems to busy loop in
> wait_port_writable calling into the scheduler over and over again.

-#define TASK_FREEZABLE                 0x00002000
+#define TASK_FREEZABLE                 0x00000000

"Fixes" the issue. Just have to find out which of users is responsible.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 10:55         ` Christian Borntraeger
@ 2022-09-26 12:13           ` Peter Zijlstra
  2022-09-26 12:32           ` Christian Borntraeger
  1 sibling, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-26 12:13 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, amit, virtualization

On Mon, Sep 26, 2022 at 12:55:21PM +0200, Christian Borntraeger wrote:
> 
> 
> Am 26.09.22 um 10:06 schrieb Christian Borntraeger:
> > 
> > 
> > Am 23.09.22 um 09:53 schrieb Christian Borntraeger:
> > > Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
> > > > Peter,
> > > > 
> > > > as a heads-up. This commit (bisected and verified) triggers a
> > > > regression in our KVM on s390x CI. The symptom is that a specific
> > > > testcase (start a guest with next kernel and a poky ramdisk,
> > > > then ssh via vsock into the guest and run the reboot command) now
> > > > takes much longer (300 instead of 20 seconds). From a first look
> > > > it seems that the sshd takes very long to end during shutdown
> > > > but I have not looked into that yet.
> > > > Any quick idea?
> > > > 
> > > > Christian
> > > 
> > > the sshd seems to hang in virtio-serial (not vsock).
> > 
> > FWIW, sshd does not seem to hang, instead it seems to busy loop in
> > wait_port_writable calling into the scheduler over and over again.
> 
> -#define TASK_FREEZABLE                 0x00002000
> +#define TASK_FREEZABLE                 0x00000000
> 
> "Fixes" the issue. Just have to find out which of users is responsible.

Since it's not the wait_port_writable() one -- we already tested that by
virtue of 's/wait_event_freezable/wait_event/' there, it must be on the
producing side of that port. But I'm having a wee bit of trouble
following that code.

Is there a task stuck in FROZEN state? -- then again, I thought you said
there was no actual suspend involved, so that should not be it either.

I'm curious though -- how far does it get into the scheduler? It should
call schedule() with __state == TASK_INTERRUPTIBLE|TASK_FREEZABLE, which
is quite sufficient to get it off the runqueue, who then puts it back?
Or is it bailing early in the wait_event loop?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 10:55         ` Christian Borntraeger
  2022-09-26 12:13           ` Peter Zijlstra
@ 2022-09-26 12:32           ` Christian Borntraeger
  2022-09-26 12:55             ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26 12:32 UTC (permalink / raw)
  To: peterz
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 26.09.22 um 12:55 schrieb Christian Borntraeger:
> 
> 
> Am 26.09.22 um 10:06 schrieb Christian Borntraeger:
>>
>>
>> Am 23.09.22 um 09:53 schrieb Christian Borntraeger:
>>> Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
>>>> Peter,
>>>>
>>>> as a heads-up. This commit (bisected and verified) triggers a
>>>> regression in our KVM on s390x CI. The symptom is that a specific
>>>> testcase (start a guest with next kernel and a poky ramdisk,
>>>> then ssh via vsock into the guest and run the reboot command) now
>>>> takes much longer (300 instead of 20 seconds). From a first look
>>>> it seems that the sshd takes very long to end during shutdown
>>>> but I have not looked into that yet.
>>>> Any quick idea?
>>>>
>>>> Christian
>>>
>>> the sshd seems to hang in virtio-serial (not vsock).
>>
>> FWIW, sshd does not seem to hang, instead it seems to busy loop in
>> wait_port_writable calling into the scheduler over and over again.
> 
> -#define TASK_FREEZABLE                 0x00002000
> +#define TASK_FREEZABLE                 0x00000000
> 
> "Fixes" the issue. Just have to find out which of users is responsible.

So it seems that my initial test was not good enough.

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 9fa3c76a267f..e93df4f735fe 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -790,7 +790,7 @@ static int wait_port_writable(struct port *port, bool nonblock)
                 if (nonblock)
                         return -EAGAIN;
  
-               ret = wait_event_freezable(port->waitqueue,
+               ret = wait_event_interruptible(port->waitqueue,
                                            !will_write_block(port));
                 if (ret < 0)
                         return ret;

Does fix the problem.
My initial test was the following

--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -790,10 +790,8 @@ static int wait_port_writable(struct port *port, bool nonblock)
                 if (nonblock)
                         return -EAGAIN;
  
-               ret = wait_event_freezable(port->waitqueue,
+               wait_event(port->waitqueue,
                                            !will_write_block(port));
-               if (ret < 0)
-                       return ret;
         }
         /* Port got hot-unplugged. */
         if (!port->guest_connected)

and obviously it did not provide an exit path.

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 12:32           ` Christian Borntraeger
@ 2022-09-26 12:55             ` Peter Zijlstra
  2022-09-26 13:23               ` Christian Borntraeger
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-26 12:55 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization

On Mon, Sep 26, 2022 at 02:32:24PM +0200, Christian Borntraeger wrote:
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 9fa3c76a267f..e93df4f735fe 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -790,7 +790,7 @@ static int wait_port_writable(struct port *port, bool nonblock)
>                 if (nonblock)
>                         return -EAGAIN;
> -               ret = wait_event_freezable(port->waitqueue,
> +               ret = wait_event_interruptible(port->waitqueue,
>                                            !will_write_block(port));
>                 if (ret < 0)
>                         return ret;
> 
> Does fix the problem.

It's almost as if someone does try_to_wake_up(.state = TASK_FREEZABLE)
-- which would be quite insane.

Could you please test with something like the below on? I can boot that
with KVM, but obviously I didn't suffer any weirdness to begin with :/

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e6a6417211f..ef9ccfc3a8c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
+	WARN_ON_ONCE(state & TASK_FREEZABLE);
+
 	preempt_disable();
 	if (p == current) {
 		/*

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 12:55             ` Peter Zijlstra
@ 2022-09-26 13:23               ` Christian Borntraeger
  2022-09-26 13:37                 ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 26.09.22 um 14:55 schrieb Peter Zijlstra:

> Could you please test with something like the below on? I can boot that
> with KVM, but obviously I didn't suffer any weirdness to begin with :/
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4e6a6417211f..ef9ccfc3a8c0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>   	unsigned long flags;
>   	int cpu, success = 0;
>   
> +	WARN_ON_ONCE(state & TASK_FREEZABLE);
> +
>   	preempt_disable();
>   	if (p == current) {
>   		/*

Does not seem to trigger.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 13:23               ` Christian Borntraeger
@ 2022-09-26 13:37                 ` Peter Zijlstra
  2022-09-26 13:54                   ` Christian Borntraeger
  2022-09-26 15:49                   ` Christian Borntraeger
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-26 13:37 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization

On Mon, Sep 26, 2022 at 03:23:10PM +0200, Christian Borntraeger wrote:
> Am 26.09.22 um 14:55 schrieb Peter Zijlstra:
> 
> > Could you please test with something like the below on? I can boot that
> > with KVM, but obviously I didn't suffer any weirdness to begin with :/
> > 
> > ---
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4e6a6417211f..ef9ccfc3a8c0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >   	unsigned long flags;
> >   	int cpu, success = 0;
> > +	WARN_ON_ONCE(state & TASK_FREEZABLE);
> > +
> >   	preempt_disable();
> >   	if (p == current) {
> >   		/*
> 
> Does not seem to trigger.

Moo -- quite the puzzle this :/ I'll go stare at it more then.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 13:37                 ` Peter Zijlstra
@ 2022-09-26 13:54                   ` Christian Borntraeger
  2022-09-26 15:49                   ` Christian Borntraeger
  1 sibling, 0 replies; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 26.09.22 um 15:37 schrieb Peter Zijlstra:
> On Mon, Sep 26, 2022 at 03:23:10PM +0200, Christian Borntraeger wrote:
>> Am 26.09.22 um 14:55 schrieb Peter Zijlstra:
>>
>>> Could you please test with something like the below on? I can boot that
>>> with KVM, but obviously I didn't suffer any weirdness to begin with :/
>>>
>>> ---
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 4e6a6417211f..ef9ccfc3a8c0 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>    	unsigned long flags;
>>>    	int cpu, success = 0;
>>> +	WARN_ON_ONCE(state & TASK_FREEZABLE);
>>> +
>>>    	preempt_disable();
>>>    	if (p == current) {
>>>    		/*
>>
>> Does not seem to trigger.
> 
> Moo -- quite the puzzle this :/ I'll go stare at it more then.

In the end this is about the end of the sshd process (shutting it down).
I can also trigger the problem by sending a SIGTERM so its not about
the shutdown itself.
Pofiling the guest I see scheduler functions like sched_clock, pick_next_entity,
update_min_vruntime and so on with 100% system time.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 13:37                 ` Peter Zijlstra
  2022-09-26 13:54                   ` Christian Borntraeger
@ 2022-09-26 15:49                   ` Christian Borntraeger
  2022-09-26 18:06                     ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-26 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 26.09.22 um 15:37 schrieb Peter Zijlstra:
> On Mon, Sep 26, 2022 at 03:23:10PM +0200, Christian Borntraeger wrote:
>> Am 26.09.22 um 14:55 schrieb Peter Zijlstra:
>>
>>> Could you please test with something like the below on? I can boot that
>>> with KVM, but obviously I didn't suffer any weirdness to begin with :/
>>>
>>> ---
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 4e6a6417211f..ef9ccfc3a8c0 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>    	unsigned long flags;
>>>    	int cpu, success = 0;
>>> +	WARN_ON_ONCE(state & TASK_FREEZABLE);
>>> +
>>>    	preempt_disable();
>>>    	if (p == current) {
>>>    		/*
>>
>> Does not seem to trigger.
> 
> Moo -- quite the puzzle this :/ I'll go stare at it more then.

Hmm,

#define ___wait_is_interruptible(state)                                         \
         (!__builtin_constant_p(state) ||                                        \
                 state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)          \

That would not trigger when state is also TASK_FREEZABLE, no?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 15:49                   ` Christian Borntraeger
@ 2022-09-26 18:06                     ` Peter Zijlstra
  2022-09-26 18:22                       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-26 18:06 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization

On Mon, Sep 26, 2022 at 05:49:16PM +0200, Christian Borntraeger wrote:

> Hmm,
> 
> #define ___wait_is_interruptible(state)                                         \
>         (!__builtin_constant_p(state) ||                                        \
>                 state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)          \
> 
> That would not trigger when state is also TASK_FREEZABLE, no?

Spot on!

signal_pending_state() writes that as:

	state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)

which is the correct form.

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 14ad8a0e9fac..7f5a51aae0a7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -281,7 +281,7 @@ static inline void wake_up_pollfree(struct wait_queue_head *wq_head)
 
 #define ___wait_is_interruptible(state)						\
 	(!__builtin_constant_p(state) ||					\
-		state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)		\
+	 (state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
 
 extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 

Let me go git-grep some to see if there's more similar fail.

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 18:06                     ` Peter Zijlstra
@ 2022-09-26 18:22                       ` Peter Zijlstra
  2022-09-27  5:35                         ` Christian Borntraeger
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-09-26 18:22 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization

On Mon, Sep 26, 2022 at 08:06:46PM +0200, Peter Zijlstra wrote:

> Let me go git-grep some to see if there's more similar fail.

I've ended up with the below...

---
 include/linux/wait.h | 2 +-
 kernel/hung_task.c   | 8 ++++++--
 kernel/sched/core.c  | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 14ad8a0e9fac..7f5a51aae0a7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -281,7 +281,7 @@ static inline void wake_up_pollfree(struct wait_queue_head *wq_head)
 
 #define ___wait_is_interruptible(state)						\
 	(!__builtin_constant_p(state) ||					\
-		state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)		\
+	 (state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
 
 extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f1321c03c32a..4a8a713fd67b 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -191,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	hung_task_show_lock = false;
 	rcu_read_lock();
 	for_each_process_thread(g, t) {
+		unsigned int state;
+
 		if (!max_count--)
 			goto unlock;
 		if (time_after(jiffies, last_break + HUNG_TASK_LOCK_BREAK)) {
@@ -198,8 +200,10 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 				goto unlock;
 			last_break = jiffies;
 		}
-		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
-		if (READ_ONCE(t->__state) == TASK_UNINTERRUPTIBLE)
+		/* skip the TASK_KILLABLE tasks -- these can be killed */
+		state == READ_ONCE(t->__state);
+		if ((state & TASK_UNINTERRUPTIBLE) &&
+		    !(state & TASK_WAKEKILL))
 			check_hung_task(t, timeout);
 	}
  unlock:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1095917ed048..12ee5b98e2c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8885,7 +8885,7 @@ state_filter_match(unsigned long state_filter, struct task_struct *p)
 	 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
 	 * TASK_KILLABLE).
 	 */
-	if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
+	if (state_filter == TASK_UNINTERRUPTIBLE && state & TASK_NOLOAD)
 		return false;
 
 	return true;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-26 18:22                       ` Peter Zijlstra
@ 2022-09-27  5:35                         ` Christian Borntraeger
  2022-09-28  5:44                           ` Christian Borntraeger
  0 siblings, 1 reply; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-27  5:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 26.09.22 um 20:22 schrieb Peter Zijlstra:
> On Mon, Sep 26, 2022 at 08:06:46PM +0200, Peter Zijlstra wrote:
> 
>> Let me go git-grep some to see if there's more similar fail.
> 
> I've ended up with the below...

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

Kind of scary that nobody else has reported any regression. I guess the freezable variant is just not used widely.
> 
> ---
>   include/linux/wait.h | 2 +-
>   kernel/hung_task.c   | 8 ++++++--
>   kernel/sched/core.c  | 2 +-
>   3 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 14ad8a0e9fac..7f5a51aae0a7 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -281,7 +281,7 @@ static inline void wake_up_pollfree(struct wait_queue_head *wq_head)
>   
>   #define ___wait_is_interruptible(state)						\
>   	(!__builtin_constant_p(state) ||					\
> -		state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)		\
> +	 (state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
>   
>   extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
>   
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index f1321c03c32a..4a8a713fd67b 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -191,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>   	hung_task_show_lock = false;
>   	rcu_read_lock();
>   	for_each_process_thread(g, t) {
> +		unsigned int state;
> +
>   		if (!max_count--)
>   			goto unlock;
>   		if (time_after(jiffies, last_break + HUNG_TASK_LOCK_BREAK)) {
> @@ -198,8 +200,10 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>   				goto unlock;
>   			last_break = jiffies;
>   		}
> -		/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
> -		if (READ_ONCE(t->__state) == TASK_UNINTERRUPTIBLE)
> +		/* skip the TASK_KILLABLE tasks -- these can be killed */
> +		state == READ_ONCE(t->__state);
> +		if ((state & TASK_UNINTERRUPTIBLE) &&
> +		    !(state & TASK_WAKEKILL))
>   			check_hung_task(t, timeout);
>   	}
>    unlock:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1095917ed048..12ee5b98e2c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8885,7 +8885,7 @@ state_filter_match(unsigned long state_filter, struct task_struct *p)
>   	 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
>   	 * TASK_KILLABLE).
>   	 */
> -	if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
> +	if (state_filter == TASK_UNINTERRUPTIBLE && state & TASK_NOLOAD)
>   		return false;
>   
>   	return true;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-09-27  5:35                         ` Christian Borntraeger
@ 2022-09-28  5:44                           ` Christian Borntraeger
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Borntraeger @ 2022-09-28  5:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, dietmar.eggemann, ebiederm, linux-kernel, linux-pm,
	mgorman, mingo, oleg, rjw, rostedt, tj, vincent.guittot, will,
	Marc Hartmayer, Amit Shah, virtualization



Am 27.09.22 um 07:35 schrieb Christian Borntraeger:
> 
> 
> Am 26.09.22 um 20:22 schrieb Peter Zijlstra:
>> On Mon, Sep 26, 2022 at 08:06:46PM +0200, Peter Zijlstra wrote:
>>
>>> Let me go git-grep some to see if there's more similar fail.
>>
>> I've ended up with the below...
> 
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> 
> Kind of scary that nobody else has reported any regression. I guess the freezable variant is just not used widely.

Will you queue this fix for next soon?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
                     ` (2 preceding siblings ...)
  2022-09-23  7:21   ` Christian Borntraeger
@ 2022-10-21 17:22   ` Ville Syrjälä
  2022-10-25  4:52     ` Ville Syrjälä
  3 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-10-21 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Mon, Aug 22, 2022 at 01:18:22PM +0200, Peter Zijlstra wrote:
> +#ifdef CONFIG_LOCKDEP
> +	/*
> +	 * It's dangerous to freeze with locks held; there be dragons there.
> +	 */
> +	if (!(state & __TASK_FREEZABLE_UNSAFE))
> +		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> +#endif

We now seem to be hitting this sporadically in the intel gfx CI.

I've spotted it on two machines so far:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12270/shard-tglb7/igt@gem_ctx_isolation@preservation-s3@vcs0.html
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109950v1/shard-snb5/igt@kms_flip@flip-vs-suspend-interruptible@a-vga1.html

Here's the full splat. Looks a bit funny since the
WARN()->printk()->console_lock() itself trips lockdep:

<6>[   59.998117] PM: suspend entry (s2idle)
<6>[   59.999878] Filesystems sync: 0.001 seconds
<6>[   60.000881] Freezing user space processes ... 
<4>[   60.001059] ------------[ cut here ]------------
<4>[   60.001071] ======================================================
<4>[   60.001071] WARNING: possible circular locking dependency detected
<4>[   60.001072] 6.1.0-rc1-CI_DRM_12270-ga9d18ead9885+ #1 Not tainted
<4>[   60.001073] ------------------------------------------------------
<4>[   60.001073] rtcwake/1152 is trying to acquire lock:
<4>[   60.001074] ffffffff82735198 ((console_sem).lock){..-.}-{2:2}, at: down_trylock+0xa/0x30
<4>[   60.001082] 
                  but task is already holding lock:
<4>[   60.001082] ffff888111a708e0 (&p->pi_lock){-.-.}-{2:2}, at: task_call_func+0x34/0xe0
<4>[   60.001088] 
                  which lock already depends on the new lock.

<4>[   60.001089] 
                  the existing dependency chain (in reverse order) is:
<4>[   60.001089] 
                  -> #1 (&p->pi_lock){-.-.}-{2:2}:
<4>[   60.001091]        lock_acquire+0xd3/0x310
<4>[   60.001094]        _raw_spin_lock_irqsave+0x33/0x50
<4>[   60.001097]        try_to_wake_up+0x6b/0x610
<4>[   60.001098]        up+0x3b/0x50
<4>[   60.001099]        __up_console_sem+0x5c/0x70
<4>[   60.001102]        console_unlock+0x1bc/0x1d0
<4>[   60.001104]        do_con_write+0x654/0xa20
<4>[   60.001108]        con_write+0xa/0x20
<4>[   60.001110]        do_output_char+0x119/0x1e0
<4>[   60.001113]        n_tty_write+0x20f/0x490
<4>[   60.001114]        file_tty_write.isra.29+0x17d/0x320
<4>[   60.001117]        do_iter_readv_writev+0xdb/0x140
<4>[   60.001120]        do_iter_write+0x6c/0x1a0
<4>[   60.001121]        vfs_writev+0x97/0x290
<4>[   60.001123]        do_writev+0x63/0x110
<4>[   60.001125]        do_syscall_64+0x37/0x90
<4>[   60.001128]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[   60.001130] 
                  -> #0 ((console_sem).lock){..-.}-{2:2}:
<4>[   60.001131]        validate_chain+0xb3d/0x2000
<4>[   60.001132]        __lock_acquire+0x5a4/0xb70
<4>[   60.001133]        lock_acquire+0xd3/0x310
<4>[   60.001134]        _raw_spin_lock_irqsave+0x33/0x50
<4>[   60.001136]        down_trylock+0xa/0x30
<4>[   60.001137]        __down_trylock_console_sem+0x25/0xb0
<4>[   60.001139]        console_trylock+0xe/0x70
<4>[   60.001140]        vprintk_emit+0x13c/0x380
<4>[   60.001142]        _printk+0x53/0x6e
<4>[   60.001145]        report_bug.cold.2+0x10/0x52
<4>[   60.001147]        handle_bug+0x3f/0x70
<4>[   60.001148]        exc_invalid_op+0x13/0x60
<4>[   60.001150]        asm_exc_invalid_op+0x16/0x20
<4>[   60.001152]        __set_task_frozen+0x58/0x80
<4>[   60.001156]        task_call_func+0xc2/0xe0
<4>[   60.001157]        freeze_task+0x84/0xe0
<4>[   60.001159]        try_to_freeze_tasks+0xac/0x260
<4>[   60.001160]        freeze_processes+0x56/0xb0
<4>[   60.001162]        pm_suspend.cold.7+0x1d9/0x31c
<4>[   60.001164]        state_store+0x7b/0xe0
<4>[   60.001165]        kernfs_fop_write_iter+0x121/0x1c0
<4>[   60.001169]        vfs_write+0x34c/0x4e0
<4>[   60.001170]        ksys_write+0x57/0xd0
<4>[   60.001172]        do_syscall_64+0x37/0x90
<4>[   60.001174]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[   60.001176] 
                  other info that might help us debug this:

<4>[   60.001176]  Possible unsafe locking scenario:

<4>[   60.001176]        CPU0                    CPU1
<4>[   60.001176]        ----                    ----
<4>[   60.001177]   lock(&p->pi_lock);
<4>[   60.001177]                                lock((console_sem).lock);
<4>[   60.001178]                                lock(&p->pi_lock);
<4>[   60.001179]   lock((console_sem).lock);
<4>[   60.001179] 
                   *** DEADLOCK ***

<4>[   60.001180] 7 locks held by rtcwake/1152:
<4>[   60.001180]  #0: ffff888105e99430 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x57/0xd0
<4>[   60.001184]  #1: ffff88810a048288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xee/0x1c0
<4>[   60.001191]  #2: ffff888100c58538 (kn->active#155){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf7/0x1c0
<4>[   60.001194]  #3: ffffffff8264db08 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend.cold.7+0xfa/0x31c
<4>[   60.001197]  #4: ffffffff82606098 (tasklist_lock){.+.+}-{2:2}, at: try_to_freeze_tasks+0x63/0x260
<4>[   60.001201]  #5: ffffffff8273aed8 (freezer_lock){....}-{2:2}, at: freeze_task+0x27/0xe0
<4>[   60.001204]  #6: ffff888111a708e0 (&p->pi_lock){-.-.}-{2:2}, at: task_call_func+0x34/0xe0
<4>[   60.001207] 
                  stack backtrace:
<4>[   60.001207] CPU: 2 PID: 1152 Comm: rtcwake Not tainted 6.1.0-rc1-CI_DRM_12270-ga9d18ead9885+ #1
<4>[   60.001210] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.3197.A00.2005110542 05/11/2020
<4>[   60.001211] Call Trace:
<4>[   60.001211]  <TASK>
<4>[   60.001212]  dump_stack_lvl+0x56/0x7f
<4>[   60.001215]  check_noncircular+0x132/0x150
<4>[   60.001217]  validate_chain+0xb3d/0x2000
<4>[   60.001220]  __lock_acquire+0x5a4/0xb70
<4>[   60.001222]  lock_acquire+0xd3/0x310
<4>[   60.001223]  ? down_trylock+0xa/0x30
<4>[   60.001226]  ? vprintk_emit+0x13c/0x380
<4>[   60.001228]  _raw_spin_lock_irqsave+0x33/0x50
<4>[   60.001230]  ? down_trylock+0xa/0x30
<4>[   60.001231]  down_trylock+0xa/0x30
<4>[   60.001233]  __down_trylock_console_sem+0x25/0xb0
<4>[   60.001234]  console_trylock+0xe/0x70
<4>[   60.001235]  vprintk_emit+0x13c/0x380
<4>[   60.001237]  _printk+0x53/0x6e
<4>[   60.001240]  ? __set_task_frozen+0x58/0x80
<4>[   60.001241]  report_bug.cold.2+0x10/0x52
<4>[   60.001244]  handle_bug+0x3f/0x70
<4>[   60.001245]  exc_invalid_op+0x13/0x60
<4>[   60.001247]  asm_exc_invalid_op+0x16/0x20
<4>[   60.001250] RIP: 0010:__set_task_frozen+0x58/0x80
<4>[   60.001252] Code: f7 c5 00 20 00 00 74 06 40 f6 c5 03 74 3a 81 e5 00 40 00 00 75 16 8b 15 a2 b9 71 01 85 d2 74 0c 8b 83 60 09 00 00 85 c0 74 02 <0f> 0b c7 43 18 00 80 00 00 b8 00 80 00 00 5b 5d c3 cc cc cc cc 31
<4>[   60.001254] RSP: 0018:ffffc9000335fcf0 EFLAGS: 00010002
<4>[   60.001255] RAX: 0000000000000001 RBX: ffff888111a70040 RCX: 0000000000000000
<4>[   60.001256] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888111a70040
<4>[   60.001257] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000fffffffe
<4>[   60.001258] R10: 0000000001e6f6b9 R11: 00000000934a4c67 R12: 0000000000000246
<4>[   60.001259] R13: ffffffff811653e0 R14: 0000000000000000 R15: ffff888111a70040
<4>[   60.001260]  ? __set_task_special+0x40/0x40
<4>[   60.001263]  task_call_func+0xc2/0xe0
<4>[   60.001265]  freeze_task+0x84/0xe0
<4>[   60.001267]  try_to_freeze_tasks+0xac/0x260
<4>[   60.001270]  freeze_processes+0x56/0xb0
<4>[   60.001272]  pm_suspend.cold.7+0x1d9/0x31c
<4>[   60.001274]  state_store+0x7b/0xe0
<4>[   60.001276]  kernfs_fop_write_iter+0x121/0x1c0
<4>[   60.001278]  vfs_write+0x34c/0x4e0
<4>[   60.001281]  ksys_write+0x57/0xd0
<4>[   60.001284]  do_syscall_64+0x37/0x90
<4>[   60.001285]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[   60.001288] RIP: 0033:0x7fb4705521e7
<4>[   60.001289] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
<4>[   60.001290] RSP: 002b:00007ffe3efac3d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[   60.001291] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb4705521e7
<4>[   60.001292] RDX: 0000000000000004 RSI: 0000559af7969590 RDI: 000000000000000b
<4>[   60.001293] RBP: 0000559af7969590 R08: 0000000000000000 R09: 0000000000000004
<4>[   60.001293] R10: 0000559af60922a6 R11: 0000000000000246 R12: 0000000000000004
<4>[   60.001294] R13: 0000559af7967540 R14: 00007fb47062e4a0 R15: 00007fb47062d8a0
<4>[   60.001296]  </TASK>
<4>[   60.001634] WARNING: CPU: 2 PID: 1152 at kernel/freezer.c:129 __set_task_frozen+0x58/0x80
<4>[   60.001641] Modules linked in: fuse snd_hda_codec_hdmi i915 x86_pkg_temp_thermal coretemp mei_pxp mei_hdcp kvm_intel wmi_bmof snd_hda_intel kvm snd_intel_dspcfg prime_numbers snd_hda_codec cdc_ether ttm e1000e irqbypass snd_hwdep crct10dif_pclmul usbnet drm_buddy crc32_pclmul mii ghash_clmulni_intel snd_hda_core drm_display_helper ptp i2c_i801 pps_core mei_me drm_kms_helper i2c_smbus snd_pcm syscopyarea mei sysfillrect sysimgblt intel_lpss_pci fb_sys_fops video wmi
<4>[   60.001717] CPU: 2 PID: 1152 Comm: rtcwake Not tainted 6.1.0-rc1-CI_DRM_12270-ga9d18ead9885+ #1
<4>[   60.001723] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.3197.A00.2005110542 05/11/2020
<4>[   60.001729] RIP: 0010:__set_task_frozen+0x58/0x80
<4>[   60.001735] Code: f7 c5 00 20 00 00 74 06 40 f6 c5 03 74 3a 81 e5 00 40 00 00 75 16 8b 15 a2 b9 71 01 85 d2 74 0c 8b 83 60 09 00 00 85 c0 74 02 <0f> 0b c7 43 18 00 80 00 00 b8 00 80 00 00 5b 5d c3 cc cc cc cc 31
<4>[   60.001744] RSP: 0018:ffffc9000335fcf0 EFLAGS: 00010002
<4>[   60.001747] RAX: 0000000000000001 RBX: ffff888111a70040 RCX: 0000000000000000
<4>[   60.001751] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888111a70040
<4>[   60.001757] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000fffffffe
<4>[   60.001763] R10: 0000000001e6f6b9 R11: 00000000934a4c67 R12: 0000000000000246
<4>[   60.001769] R13: ffffffff811653e0 R14: 0000000000000000 R15: ffff888111a70040
<4>[   60.001776] FS:  00007fb47043e740(0000) GS:ffff8884a0300000(0000) knlGS:0000000000000000
<4>[   60.001784] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   60.001789] CR2: 00007f3903c603d8 CR3: 000000010a25a003 CR4: 0000000000770ee0
<4>[   60.001795] PKRU: 55555554
<4>[   60.001798] Call Trace:
<4>[   60.001801]  <TASK>
<4>[   60.001804]  task_call_func+0xc2/0xe0
<4>[   60.001809]  freeze_task+0x84/0xe0
<4>[   60.001815]  try_to_freeze_tasks+0xac/0x260
<4>[   60.001821]  freeze_processes+0x56/0xb0
<4>[   60.001826]  pm_suspend.cold.7+0x1d9/0x31c
<4>[   60.001832]  state_store+0x7b/0xe0
<4>[   60.001837]  kernfs_fop_write_iter+0x121/0x1c0
<4>[   60.001843]  vfs_write+0x34c/0x4e0
<4>[   60.001850]  ksys_write+0x57/0xd0
<4>[   60.001855]  do_syscall_64+0x37/0x90
<4>[   60.001860]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[   60.001866] RIP: 0033:0x7fb4705521e7
<4>[   60.001870] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
<4>[   60.001884] RSP: 002b:00007ffe3efac3d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[   60.001892] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb4705521e7
<4>[   60.001898] RDX: 0000000000000004 RSI: 0000559af7969590 RDI: 000000000000000b
<4>[   60.001904] RBP: 0000559af7969590 R08: 0000000000000000 R09: 0000000000000004
<4>[   60.001910] R10: 0000559af60922a6 R11: 0000000000000246 R12: 0000000000000004
<4>[   60.001917] R13: 0000559af7967540 R14: 00007fb47062e4a0 R15: 00007fb47062d8a0
<4>[   60.001925]  </TASK>
<4>[   60.001928] irq event stamp: 8712
<4>[   60.001931] hardirqs last  enabled at (8711): [<ffffffff81b73784>] _raw_spin_unlock_irqrestore+0x54/0x70
<4>[   60.001941] hardirqs last disabled at (8712): [<ffffffff81b7352b>] _raw_spin_lock_irqsave+0x4b/0x50
<4>[   60.001950] softirqs last  enabled at (8348): [<ffffffff81e0031e>] __do_softirq+0x31e/0x48a
<4>[   60.001957] softirqs last disabled at (8341): [<ffffffff810c1b08>] irq_exit_rcu+0xb8/0xe0
<4>[   60.001969] ---[ end trace 0000000000000000 ]---
<4>[   60.003326] (elapsed 0.002 seconds) done.
<6>[   60.003332] OOM killer disabled.
<6>[   60.003334] Freezing remaining freezable tasks ... (elapsed 0.006 seconds) done.
<6>[   60.010062] printk: Suspending console(s) (use no_console_suspend to debug)
<6>[   60.041543] e1000e: EEE TX LPI TIMER: 00000011
<6>[   60.368938] ACPI: EC: interrupt blocked

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-21 17:22   ` Ville Syrjälä
@ 2022-10-25  4:52     ` Ville Syrjälä
  2022-10-25 10:49       ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-10-25  4:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Fri, Oct 21, 2022 at 08:22:41PM +0300, Ville Syrjälä wrote:
> On Mon, Aug 22, 2022 at 01:18:22PM +0200, Peter Zijlstra wrote:
> > +#ifdef CONFIG_LOCKDEP
> > +	/*
> > +	 * It's dangerous to freeze with locks held; there be dragons there.
> > +	 */
> > +	if (!(state & __TASK_FREEZABLE_UNSAFE))
> > +		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > +#endif
> 
> We now seem to be hitting this sporadically in the intel gfx CI.
> 
> I've spotted it on two machines so far:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12270/shard-tglb7/igt@gem_ctx_isolation@preservation-s3@vcs0.html
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109950v1/shard-snb5/igt@kms_flip@flip-vs-suspend-interruptible@a-vga1.html

Sadly no luck in reproducing this locally so far. In the meantime
I added the following patch into our topic/core-for-CI branch in
the hopes of CI stumbling on it again and dumping a bit more data:

--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -125,8 +125,16 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
 	/*
 	 * It's dangerous to freeze with locks held; there be dragons there.
 	 */
-	if (!(state & __TASK_FREEZABLE_UNSAFE))
-		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+	if (!(state & __TASK_FREEZABLE_UNSAFE)) {
+		static bool warned = false;
+
+		if (!warned && debug_locks && p->lockdep_depth) {
+			debug_show_held_locks(p);
+			WARN(1, "%s/%d holding locks while freezing\n",
+			     p->comm, task_pid_nr(p));
+			warned = true;
+		}
+	}
 #endif
 
 	WRITE_ONCE(p->__state, TASK_FROZEN);
-- 
2.37.4

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-25  4:52     ` Ville Syrjälä
@ 2022-10-25 10:49       ` Peter Zijlstra
  2022-10-26 10:32         ` Ville Syrjälä
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-25 10:49 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Tue, Oct 25, 2022 at 07:52:07AM +0300, Ville Syrjälä wrote:
> On Fri, Oct 21, 2022 at 08:22:41PM +0300, Ville Syrjälä wrote:
> > On Mon, Aug 22, 2022 at 01:18:22PM +0200, Peter Zijlstra wrote:
> > > +#ifdef CONFIG_LOCKDEP
> > > +	/*
> > > +	 * It's dangerous to freeze with locks held; there be dragons there.
> > > +	 */
> > > +	if (!(state & __TASK_FREEZABLE_UNSAFE))
> > > +		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > > +#endif
> > 
> > We now seem to be hitting this sporadically in the intel gfx CI.
> > 
> > I've spotted it on two machines so far:
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12270/shard-tglb7/igt@gem_ctx_isolation@preservation-s3@vcs0.html
> > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109950v1/shard-snb5/igt@kms_flip@flip-vs-suspend-interruptible@a-vga1.html
> 
> Sadly no luck in reproducing this locally so far. In the meantime
> I added the following patch into our topic/core-for-CI branch in
> the hopes of CI stumbling on it again and dumping a bit more data:
> 
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -125,8 +125,16 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
>  	/*
>  	 * It's dangerous to freeze with locks held; there be dragons there.
>  	 */
> -	if (!(state & __TASK_FREEZABLE_UNSAFE))
> -		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> +	if (!(state & __TASK_FREEZABLE_UNSAFE)) {
> +		static bool warned = false;
> +
> +		if (!warned && debug_locks && p->lockdep_depth) {
> +			debug_show_held_locks(p);
> +			WARN(1, "%s/%d holding locks while freezing\n",
> +			     p->comm, task_pid_nr(p));
> +			warned = true;
> +		}
> +	}
>  #endif
>  
>  	WRITE_ONCE(p->__state, TASK_FROZEN);

That seems reasonable. But note that this constraint isn't new; the
previous freezer had much the same constraint but perhaps it wasn't
triggered for mysterious raisins. see the previous
try_to_freeze_unsafe() function.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-25 10:49       ` Peter Zijlstra
@ 2022-10-26 10:32         ` Ville Syrjälä
  2022-10-26 11:43           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-10-26 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Tue, Oct 25, 2022 at 12:49:13PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 25, 2022 at 07:52:07AM +0300, Ville Syrjälä wrote:
> > On Fri, Oct 21, 2022 at 08:22:41PM +0300, Ville Syrjälä wrote:
> > > On Mon, Aug 22, 2022 at 01:18:22PM +0200, Peter Zijlstra wrote:
> > > > +#ifdef CONFIG_LOCKDEP
> > > > +	/*
> > > > +	 * It's dangerous to freeze with locks held; there be dragons there.
> > > > +	 */
> > > > +	if (!(state & __TASK_FREEZABLE_UNSAFE))
> > > > +		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > > > +#endif
> > > 
> > > We now seem to be hitting this sporadically in the intel gfx CI.
> > > 
> > > I've spotted it on two machines so far:
> > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12270/shard-tglb7/igt@gem_ctx_isolation@preservation-s3@vcs0.html
> > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109950v1/shard-snb5/igt@kms_flip@flip-vs-suspend-interruptible@a-vga1.html
> > 
> > Sadly no luck in reproducing this locally so far. In the meantime
> > I added the following patch into our topic/core-for-CI branch in
> > the hopes of CI stumbling on it again and dumping a bit more data:
> > 
> > --- a/kernel/freezer.c
> > +++ b/kernel/freezer.c
> > @@ -125,8 +125,16 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
> >  	/*
> >  	 * It's dangerous to freeze with locks held; there be dragons there.
> >  	 */
> > -	if (!(state & __TASK_FREEZABLE_UNSAFE))
> > -		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > +	if (!(state & __TASK_FREEZABLE_UNSAFE)) {
> > +		static bool warned = false;
> > +
> > +		if (!warned && debug_locks && p->lockdep_depth) {
> > +			debug_show_held_locks(p);
> > +			WARN(1, "%s/%d holding locks while freezing\n",
> > +			     p->comm, task_pid_nr(p));
> > +			warned = true;
> > +		}
> > +	}
> >  #endif
> >  
> >  	WRITE_ONCE(p->__state, TASK_FROZEN);
> 
> That seems reasonable. But note that this constraint isn't new; the
> previous freezer had much the same constraint but perhaps it wasn't
> triggered for mysterious raisins. see the previous
> try_to_freeze_unsafe() function.

Looks like we caught one with the extra debugs now.

Short form looks to be this:
<4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
<4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
<4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing

Based on a quick google that process seems to be some rsyslog thing.


Here's the full splat with the console_lock mess included:
<6>[  355.437502] Freezing user space processes ... 
<4>[  355.437846] 1 lock held by rs:main Q:Reg/359:

<4>[  355.437865] ======================================================
<4>[  355.437866] WARNING: possible circular locking dependency detected
<4>[  355.437867] 6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1 Tainted: G     U            
<4>[  355.437870] ------------------------------------------------------
<4>[  355.437871] rtcwake/6211 is trying to acquire lock:
<4>[  355.437872] ffffffff82735198 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0xa/0x30
<4>[  355.437883] 
                  but task is already holding lock:
<4>[  355.437885] ffff88810d0908e0 (&p->pi_lock){-.-.}-{2:2}, at: task_call_func+0x34/0xe0
<4>[  355.437893] 
                  which lock already depends on the new lock.

<4>[  355.437894] 
                  the existing dependency chain (in reverse order) is:
<4>[  355.437895] 
                  -> #1 (&p->pi_lock){-.-.}-{2:2}:
<4>[  355.437899]        lock_acquire+0xd3/0x310
<4>[  355.437903]        _raw_spin_lock_irqsave+0x33/0x50
<4>[  355.437907]        try_to_wake_up+0x6b/0x610
<4>[  355.437911]        up+0x3b/0x50
<4>[  355.437914]        __up_console_sem+0x5c/0x70
<4>[  355.437917]        console_unlock+0x1bc/0x1d0
<4>[  355.437920]        con_font_op+0x2e2/0x3a0
<4>[  355.437925]        vt_ioctl+0x4f5/0x13b0
<4>[  355.437930]        tty_ioctl+0x233/0x8e0
<4>[  355.437934]        __x64_sys_ioctl+0x71/0xb0
<4>[  355.437938]        do_syscall_64+0x3a/0x90
<4>[  355.437943]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[  355.437948] 
                  -> #0 ((console_sem).lock){-.-.}-{2:2}:
<4>[  355.437952]        validate_chain+0xb3d/0x2000
<4>[  355.437955]        __lock_acquire+0x5a4/0xb70
<4>[  355.437958]        lock_acquire+0xd3/0x310
<4>[  355.437960]        _raw_spin_lock_irqsave+0x33/0x50
<4>[  355.437965]        down_trylock+0xa/0x30
<4>[  355.437968]        __down_trylock_console_sem+0x25/0xb0
<4>[  355.437971]        console_trylock+0xe/0x70
<4>[  355.437974]        vprintk_emit+0x13c/0x380
<4>[  355.437977]        _printk+0x53/0x6e
<4>[  355.437981]        lockdep_print_held_locks+0x5c/0xab
<4>[  355.437985]        __set_task_frozen+0x6d/0xb0
<4>[  355.437989]        task_call_func+0xc4/0xe0
<4>[  355.437993]        freeze_task+0x84/0xe0
<4>[  355.437997]        try_to_freeze_tasks+0xac/0x260
<4>[  355.438001]        freeze_processes+0x56/0xb0
<4>[  355.438005]        pm_suspend.cold.7+0x1d9/0x31c
<4>[  355.438008]        state_store+0x7b/0xe0
<4>[  355.438012]        kernfs_fop_write_iter+0x124/0x1c0
<4>[  355.438016]        vfs_write+0x34f/0x4e0
<4>[  355.438021]        ksys_write+0x57/0xd0
<4>[  355.438025]        do_syscall_64+0x3a/0x90
<4>[  355.438029]        entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[  355.438034] 
                  other info that might help us debug this:

<4>[  355.438035]  Possible unsafe locking scenario:

<4>[  355.438036]        CPU0                    CPU1
<4>[  355.438037]        ----                    ----
<4>[  355.438037]   lock(&p->pi_lock);
<4>[  355.438040]                                lock((console_sem).lock);
<4>[  355.438042]                                lock(&p->pi_lock);
<4>[  355.438044]   lock((console_sem).lock);
<4>[  355.438046] 
                   *** DEADLOCK ***

<4>[  355.438047] 7 locks held by rtcwake/6211:
<4>[  355.438049]  #0: ffff888104d11430 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x57/0xd0
<4>[  355.438058]  #1: ffff88810e6bac88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xee/0x1c0
<4>[  355.438066]  #2: ffff8881001c0538 (kn->active#167){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf7/0x1c0
<4>[  355.438074]  #3: ffffffff8264db08 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend.cold.7+0xfa/0x31c
<4>[  355.438082]  #4: ffffffff82606098 (tasklist_lock){.+.+}-{2:2}, at: try_to_freeze_tasks+0x63/0x260
<4>[  355.438090]  #5: ffffffff8273aed8 (freezer_lock){....}-{2:2}, at: freeze_task+0x27/0xe0
<4>[  355.438098]  #6: ffff88810d0908e0 (&p->pi_lock){-.-.}-{2:2}, at: task_call_func+0x34/0xe0
<4>[  355.438105] 
                  stack backtrace:
<4>[  355.438107] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
<4>[  355.438110] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
<4>[  355.438112] Call Trace:
<4>[  355.438114]  <TASK>
<4>[  355.438116]  dump_stack_lvl+0x56/0x7f
<4>[  355.438121]  check_noncircular+0x132/0x150
<4>[  355.438125]  ? validate_chain+0x247/0x2000
<4>[  355.438131]  validate_chain+0xb3d/0x2000
<4>[  355.438138]  __lock_acquire+0x5a4/0xb70
<4>[  355.438144]  lock_acquire+0xd3/0x310
<4>[  355.438147]  ? down_trylock+0xa/0x30
<4>[  355.438154]  ? vprintk_emit+0x13c/0x380
<4>[  355.438158]  _raw_spin_lock_irqsave+0x33/0x50
<4>[  355.438163]  ? down_trylock+0xa/0x30
<4>[  355.438167]  down_trylock+0xa/0x30
<4>[  355.438171]  __down_trylock_console_sem+0x25/0xb0
<4>[  355.438175]  console_trylock+0xe/0x70
<4>[  355.438178]  vprintk_emit+0x13c/0x380
<4>[  355.438183]  ? __set_task_special+0x40/0x40
<4>[  355.438187]  _printk+0x53/0x6e
<4>[  355.438195]  lockdep_print_held_locks+0x5c/0xab
<4>[  355.438199]  ? __set_task_special+0x40/0x40
<4>[  355.438203]  __set_task_frozen+0x6d/0xb0
<4>[  355.438208]  task_call_func+0xc4/0xe0
<4>[  355.438214]  freeze_task+0x84/0xe0
<4>[  355.438219]  try_to_freeze_tasks+0xac/0x260
<4>[  355.438226]  freeze_processes+0x56/0xb0
<4>[  355.438230]  pm_suspend.cold.7+0x1d9/0x31c
<4>[  355.438235]  state_store+0x7b/0xe0
<4>[  355.438241]  kernfs_fop_write_iter+0x124/0x1c0
<4>[  355.438247]  vfs_write+0x34f/0x4e0
<4>[  355.438255]  ksys_write+0x57/0xd0
<4>[  355.438261]  do_syscall_64+0x3a/0x90
<4>[  355.438266]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[  355.438271] RIP: 0033:0x7fcfa44d80a7
<4>[  355.438275] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
<4>[  355.438278] RSP: 002b:00007ffd72160e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[  355.438281] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007fcfa44d80a7
<4>[  355.438284] RDX: 0000000000000007 RSI: 000055dd45bf4590 RDI: 000000000000000b
<4>[  355.438286] RBP: 000055dd45bf4590 R08: 0000000000000000 R09: 0000000000000007
<4>[  355.438288] R10: 000055dd441d22a6 R11: 0000000000000246 R12: 0000000000000007
<4>[  355.438290] R13: 000055dd45bf2540 R14: 00007fcfa45b34a0 R15: 00007fcfa45b28a0
<4>[  355.438298]  </TASK>
<4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
<4>[  355.438429] ------------[ cut here ]------------
<4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
<4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
<4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
<4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
<4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
<4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
<4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
<4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
<4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
<4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
<4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
<4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
<4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
<4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
<4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
<4>[  355.438586] Call Trace:
<4>[  355.438589]  <TASK>
<4>[  355.438592]  task_call_func+0xc4/0xe0
<4>[  355.438600]  freeze_task+0x84/0xe0
<4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
<4>[  355.438616]  freeze_processes+0x56/0xb0
<4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
<4>[  355.438629]  state_store+0x7b/0xe0
<4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
<4>[  355.438644]  vfs_write+0x34f/0x4e0
<4>[  355.438655]  ksys_write+0x57/0xd0
<4>[  355.438663]  do_syscall_64+0x3a/0x90
<4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4>[  355.438676] RIP: 0033:0x7fcfa44d80a7
<4>[  355.438681] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
<4>[  355.438685] RSP: 002b:00007ffd72160e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[  355.438690] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007fcfa44d80a7
<4>[  355.438695] RDX: 0000000000000007 RSI: 000055dd45bf4590 RDI: 000000000000000b
<4>[  355.438698] RBP: 000055dd45bf4590 R08: 0000000000000000 R09: 0000000000000007
<4>[  355.438702] R10: 000055dd441d22a6 R11: 0000000000000246 R12: 0000000000000007
<4>[  355.438706] R13: 000055dd45bf2540 R14: 00007fcfa45b34a0 R15: 00007fcfa45b28a0
<4>[  355.438716]  </TASK>
<4>[  355.438718] irq event stamp: 7462
<4>[  355.438721] hardirqs last  enabled at (7461): [<ffffffff81b73764>] _raw_spin_unlock_irqrestore+0x54/0x70
<4>[  355.438729] hardirqs last disabled at (7462): [<ffffffff81b7350b>] _raw_spin_lock_irqsave+0x4b/0x50
<4>[  355.438736] softirqs last  enabled at (7322): [<ffffffff81e0031e>] __do_softirq+0x31e/0x48a
<4>[  355.438742] softirqs last disabled at (7313): [<ffffffff810c1b58>] irq_exit_rcu+0xb8/0xe0
<4>[  355.438749] ---[ end trace 0000000000000000 ]---
<4>[  355.440204] (elapsed 0.002 seconds) done.
<6>[  355.440210] OOM killer disabled.
<6>[  355.440212] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-26 10:32         ` Ville Syrjälä
@ 2022-10-26 11:43           ` Peter Zijlstra
  2022-10-26 12:12             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-26 11:43 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Oct 26, 2022 at 01:32:31PM +0300, Ville Syrjälä wrote:
> Short form looks to be this:
> <4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
> <4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
> <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing

> <4>[  355.438429] ------------[ cut here ]------------
> <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> <4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
> <4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
> <4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
> <4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
> <4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
> <4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
> <4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
> <4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
> <4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
> <4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
> <4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
> <4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
> <4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
> <4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
> <4>[  355.438586] Call Trace:
> <4>[  355.438589]  <TASK>
> <4>[  355.438592]  task_call_func+0xc4/0xe0
> <4>[  355.438600]  freeze_task+0x84/0xe0
> <4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
> <4>[  355.438616]  freeze_processes+0x56/0xb0
> <4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
> <4>[  355.438629]  state_store+0x7b/0xe0
> <4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
> <4>[  355.438644]  vfs_write+0x34f/0x4e0
> <4>[  355.438655]  ksys_write+0x57/0xd0
> <4>[  355.438663]  do_syscall_64+0x3a/0x90
> <4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Oh I think I see what's going on.

It's a very narrow race between schedule() and task_call_func().

  CPU0						CPU1

  __schedule()
    rq_lock();
    prev_state = READ_ONCE(prev->__state);
    if (... && prev_state) {
      deactivate_tasl(rq, prev, ...)
        prev->on_rq = 0;

						task_call_func()
						  raw_spin_lock_irqsave(p->pi_lock);
						  state = READ_ONCE(p->__state);
						  smp_rmb();
						  if (... || p->on_rq) // false!!!
						    rq = __task_rq_lock()

						  ret = func();

    next = pick_next_task();
    rq = context_switch(prev, next)
      prepare_lock_switch()
        spin_release(&__rq_lockp(rq)->dep_map...)



So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.

Could you please give the below a spin?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..f519f44cd4c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4200,6 +4200,37 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	return success;
 }
 
+static bool __task_needs_rq_lock(struct task_struct *p)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	/*
+	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
+	 * the task is blocked. Make sure to check @state since ttwu() can drop
+	 * locks at the end, see ttwu_queue_wakelist().
+	 */
+	if (state == TASK_RUNNING || state == TASK_WAKING)
+		return true;
+
+	/*
+	 * Ensure we load p->on_rq after p->__state, otherwise it would be
+	 * possible to, falsely, observe p->on_rq == 0.
+	 *
+	 * See try_to_wake_up() for a longer comment.
+	 */
+	smp_rmb();
+	if (p->on_rq)
+		return true;
+
+#ifdef CONFIG_SMP
+	smp_rmb();
+	if (p->on_cpu)
+		return true;
+#endif
+
+	return false;
+}
+
 /**
  * task_call_func - Invoke a function on task in fixed state
  * @p: Process for which the function is to be invoked, can be @current.
@@ -4217,28 +4248,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 int task_call_func(struct task_struct *p, task_call_f func, void *arg)
 {
 	struct rq *rq = NULL;
-	unsigned int state;
 	struct rq_flags rf;
 	int ret;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 
-	state = READ_ONCE(p->__state);
-
-	/*
-	 * Ensure we load p->on_rq after p->__state, otherwise it would be
-	 * possible to, falsely, observe p->on_rq == 0.
-	 *
-	 * See try_to_wake_up() for a longer comment.
-	 */
-	smp_rmb();
-
-	/*
-	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
-	 * the task is blocked. Make sure to check @state since ttwu() can drop
-	 * locks at the end, see ttwu_queue_wakelist().
-	 */
-	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
+	if (__task_needs_rq_lock(p))
 		rq = __task_rq_lock(p, &rf);
 
 	/*

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-26 11:43           ` Peter Zijlstra
@ 2022-10-26 12:12             ` Peter Zijlstra
  2022-10-26 12:14               ` Peter Zijlstra
  2022-10-27  5:58             ` Chen Yu
  2022-10-27 13:09             ` Ville Syrjälä
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-26 12:12 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2022 at 01:32:31PM +0300, Ville Syrjälä wrote:
> > Short form looks to be this:
> > <4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
> > <4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> 
> > <4>[  355.438429] ------------[ cut here ]------------
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> > <4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
> > <4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
> > <4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
> > <4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
> > <4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
> > <4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
> > <4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
> > <4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
> > <4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
> > <4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
> > <4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
> > <4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
> > <4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
> > <4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
> > <4>[  355.438586] Call Trace:
> > <4>[  355.438589]  <TASK>
> > <4>[  355.438592]  task_call_func+0xc4/0xe0
> > <4>[  355.438600]  freeze_task+0x84/0xe0
> > <4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
> > <4>[  355.438616]  freeze_processes+0x56/0xb0
> > <4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
> > <4>[  355.438629]  state_store+0x7b/0xe0
> > <4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
> > <4>[  355.438644]  vfs_write+0x34f/0x4e0
> > <4>[  355.438655]  ksys_write+0x57/0xd0
> > <4>[  355.438663]  do_syscall_64+0x3a/0x90
> > <4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Oh I think I see what's going on.
> 
> It's a very narrow race between schedule() and task_call_func().
> 
>   CPU0						CPU1
> 
>   __schedule()
>     rq_lock();
>     prev_state = READ_ONCE(prev->__state);
>     if (... && prev_state) {
>       deactivate_tasl(rq, prev, ...)
>         prev->on_rq = 0;
> 
> 						task_call_func()
> 						  raw_spin_lock_irqsave(p->pi_lock);
> 						  state = READ_ONCE(p->__state);
> 						  smp_rmb();
> 						  if (... || p->on_rq) // false!!!
> 						    rq = __task_rq_lock()
> 
> 						  ret = func();
> 
>     next = pick_next_task();
>     rq = context_switch(prev, next)
>       prepare_lock_switch()
>         spin_release(&__rq_lockp(rq)->dep_map...)
> 
> 
> 
> So while the task is on it's way out, it still holds rq->lock for a
> little while, and right then task_call_func() comes in and figures it
> doesn't need rq->lock anymore (because the task is already dequeued --
> but still running there) and then the __set_task_frozen() thing observes
> it's holding rq->lock and yells murder.
> 
> Could you please give the below a spin?

Urgh.. that'll narrow the race more, but won't solve it, that
prepare_lock_switch() is after we clear ->on_cpu.

Let me ponder this a wee bit more..

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-26 12:12             ` Peter Zijlstra
@ 2022-10-26 12:14               ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-26 12:14 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Oct 26, 2022 at 02:12:02PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 26, 2022 at 01:32:31PM +0300, Ville Syrjälä wrote:
> > > Short form looks to be this:
> > > <4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
> > > <4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
> > > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> > 
> > > <4>[  355.438429] ------------[ cut here ]------------
> > > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> > > <4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
> > > <4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
> > > <4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
> > > <4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
> > > <4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
> > > <4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
> > > <4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
> > > <4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
> > > <4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
> > > <4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
> > > <4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
> > > <4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
> > > <4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
> > > <4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > <4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
> > > <4>[  355.438586] Call Trace:
> > > <4>[  355.438589]  <TASK>
> > > <4>[  355.438592]  task_call_func+0xc4/0xe0
> > > <4>[  355.438600]  freeze_task+0x84/0xe0
> > > <4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
> > > <4>[  355.438616]  freeze_processes+0x56/0xb0
> > > <4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
> > > <4>[  355.438629]  state_store+0x7b/0xe0
> > > <4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
> > > <4>[  355.438644]  vfs_write+0x34f/0x4e0
> > > <4>[  355.438655]  ksys_write+0x57/0xd0
> > > <4>[  355.438663]  do_syscall_64+0x3a/0x90
> > > <4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > 
> > Oh I think I see what's going on.
> > 
> > It's a very narrow race between schedule() and task_call_func().
> > 
> >   CPU0						CPU1
> > 
> >   __schedule()
> >     rq_lock();
> >     prev_state = READ_ONCE(prev->__state);
> >     if (... && prev_state) {
> >       deactivate_tasl(rq, prev, ...)
> >         prev->on_rq = 0;
> > 
> > 						task_call_func()
> > 						  raw_spin_lock_irqsave(p->pi_lock);
> > 						  state = READ_ONCE(p->__state);
> > 						  smp_rmb();
> > 						  if (... || p->on_rq) // false!!!
> > 						    rq = __task_rq_lock()
> > 
> > 						  ret = func();
> > 
> >     next = pick_next_task();
> >     rq = context_switch(prev, next)
> >       prepare_lock_switch()
> >         spin_release(&__rq_lockp(rq)->dep_map...)
> > 
> > 
> > 
> > So while the task is on it's way out, it still holds rq->lock for a
> > little while, and right then task_call_func() comes in and figures it
> > doesn't need rq->lock anymore (because the task is already dequeued --
> > but still running there) and then the __set_task_frozen() thing observes
> > it's holding rq->lock and yells murder.
> > 
> > Could you please give the below a spin?
> 
> Urgh.. that'll narrow the race more, but won't solve it, that
> prepare_lock_switch() is after we clear ->on_cpu.
> 
> Let me ponder this a wee bit more..

Oh, n/m, I got myself confused, it's prepare_lock_switch() that releases
the lock (from lockdep's pov) and that *IS* before finish_task() which
clears ->on_cpu.

So all well, please test the patch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-26 11:43           ` Peter Zijlstra
  2022-10-26 12:12             ` Peter Zijlstra
@ 2022-10-27  5:58             ` Chen Yu
  2022-10-27  7:39               ` Peter Zijlstra
  2022-10-27 13:09             ` Ville Syrjälä
  2 siblings, 1 reply; 59+ messages in thread
From: Chen Yu @ 2022-10-27  5:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ville Syrjälä,
	rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On 2022-10-26 at 13:43:00 +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2022 at 01:32:31PM +0300, Ville Syrjälä wrote:
> > Short form looks to be this:
> > <4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
> > <4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> 
> > <4>[  355.438429] ------------[ cut here ]------------
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> > <4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
> > <4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
> > <4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
> > <4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
> > <4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
> > <4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
> > <4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
> > <4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
> > <4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
> > <4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
> > <4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
> > <4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
> > <4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
> > <4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
> > <4>[  355.438586] Call Trace:
> > <4>[  355.438589]  <TASK>
> > <4>[  355.438592]  task_call_func+0xc4/0xe0
> > <4>[  355.438600]  freeze_task+0x84/0xe0
> > <4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
> > <4>[  355.438616]  freeze_processes+0x56/0xb0
> > <4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
> > <4>[  355.438629]  state_store+0x7b/0xe0
> > <4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
> > <4>[  355.438644]  vfs_write+0x34f/0x4e0
> > <4>[  355.438655]  ksys_write+0x57/0xd0
> > <4>[  355.438663]  do_syscall_64+0x3a/0x90
> > <4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Oh I think I see what's going on.
> 
> It's a very narrow race between schedule() and task_call_func().
> 
>   CPU0						CPU1
> 
>   __schedule()
>     rq_lock();
>     prev_state = READ_ONCE(prev->__state);
>     if (... && prev_state) {
>       deactivate_tasl(rq, prev, ...)
>         prev->on_rq = 0;
> 
> 						task_call_func()
> 						  raw_spin_lock_irqsave(p->pi_lock);
> 						  state = READ_ONCE(p->__state);
> 						  smp_rmb();
> 						  if (... || p->on_rq) // false!!!
> 						    rq = __task_rq_lock()
> 
> 						  ret = func();
> 
>     next = pick_next_task();
>     rq = context_switch(prev, next)
>       prepare_lock_switch()
>         spin_release(&__rq_lockp(rq)->dep_map...)
> 
> 
> 
> So while the task is on it's way out, it still holds rq->lock for a
> little while, and right then task_call_func() comes in and figures it
> doesn't need rq->lock anymore (because the task is already dequeued --
> but still running there) and then the __set_task_frozen() thing observes
> it's holding rq->lock and yells murder.
> 
> Could you please give the below a spin?
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cb2aa2b54c7a..f519f44cd4c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4200,6 +4200,37 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	return success;
>  }
>  
> +static bool __task_needs_rq_lock(struct task_struct *p)
> +{
> +	unsigned int state = READ_ONCE(p->__state);
> +
> +	/*
> +	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> +	 * the task is blocked. Make sure to check @state since ttwu() can drop
> +	 * locks at the end, see ttwu_queue_wakelist().
> +	 */
> +	if (state == TASK_RUNNING || state == TASK_WAKING)
> +		return true;
> +
> +	/*
> +	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> +	 * possible to, falsely, observe p->on_rq == 0.
> +	 *
> +	 * See try_to_wake_up() for a longer comment.
> +	 */
> +	smp_rmb();
> +	if (p->on_rq)
> +		return true;
> +
> +#ifdef CONFIG_SMP
> +	smp_rmb();
> +	if (p->on_cpu)
> +		return true;
> +#endif
Should we also add p->on_cpu check to return 0 in __set_task_frozen()?
Otherwise it might still warn that p is holding the lock?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-27  5:58             ` Chen Yu
@ 2022-10-27  7:39               ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-27  7:39 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ville Syrjälä,
	rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Thu, Oct 27, 2022 at 01:58:09PM +0800, Chen Yu wrote:

> > It's a very narrow race between schedule() and task_call_func().
> > 
> >   CPU0						CPU1
> > 
> >   __schedule()
> >     rq_lock();
> >     prev_state = READ_ONCE(prev->__state);
> >     if (... && prev_state) {
> >       deactivate_tasl(rq, prev, ...)
> >         prev->on_rq = 0;
> > 
> > 						task_call_func()
> > 						  raw_spin_lock_irqsave(p->pi_lock);
> > 						  state = READ_ONCE(p->__state);
> > 						  smp_rmb();
> > 						  if (... || p->on_rq) // false!!!
> > 						    rq = __task_rq_lock()
> > 
> > 						  ret = func();
> > 
> >     next = pick_next_task();
> >     rq = context_switch(prev, next)
> >       prepare_lock_switch()
> >         spin_release(&__rq_lockp(rq)->dep_map...)
> > 
> > 
> > 
> > So while the task is on it's way out, it still holds rq->lock for a
> > little while, and right then task_call_func() comes in and figures it
> > doesn't need rq->lock anymore (because the task is already dequeued --
> > but still running there) and then the __set_task_frozen() thing observes
> > it's holding rq->lock and yells murder.
> > 
> > Could you please give the below a spin?
> > 
> > ---
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index cb2aa2b54c7a..f519f44cd4c7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4200,6 +4200,37 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	return success;
> >  }
> >  
> > +static bool __task_needs_rq_lock(struct task_struct *p)
> > +{
> > +	unsigned int state = READ_ONCE(p->__state);
> > +
> > +	/*
> > +	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> > +	 * the task is blocked. Make sure to check @state since ttwu() can drop
> > +	 * locks at the end, see ttwu_queue_wakelist().
> > +	 */
> > +	if (state == TASK_RUNNING || state == TASK_WAKING)
> > +		return true;
> > +
> > +	/*
> > +	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> > +	 * possible to, falsely, observe p->on_rq == 0.
> > +	 *
> > +	 * See try_to_wake_up() for a longer comment.
> > +	 */
> > +	smp_rmb();
> > +	if (p->on_rq)
> > +		return true;
> > +
> > +#ifdef CONFIG_SMP
> > +	smp_rmb();
> > +	if (p->on_cpu)
> > +		return true;
> > +#endif
> Should we also add p->on_cpu check to return 0 in __set_task_frozen()?
> Otherwise it might still warn that p is holding the lock?

With this, I don't think __set_task_frozen() should ever see
'p->on_cpu && !p->on_rq'. By forcing task_call_func() to acquire
rq->lock that window is closed. That is, this window only exits in
__schedule() while it holds rq->lock, since we're now serializing
against that, we should no longer observe it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-26 11:43           ` Peter Zijlstra
  2022-10-26 12:12             ` Peter Zijlstra
  2022-10-27  5:58             ` Chen Yu
@ 2022-10-27 13:09             ` Ville Syrjälä
  2022-10-27 16:53               ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-10-27 13:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 26, 2022 at 01:32:31PM +0300, Ville Syrjälä wrote:
> > Short form looks to be this:
> > <4>[  355.437846] 1 lock held by rs:main Q:Reg/359:
> > <4>[  355.438418]  #0: ffff88844693b758 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x1b/0x30
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> 
> > <4>[  355.438429] ------------[ cut here ]------------
> > <4>[  355.438432] rs:main Q:Reg/359 holding locks while freezing
> > <4>[  355.438439] WARNING: CPU: 0 PID: 6211 at kernel/freezer.c:134 __set_task_frozen+0x86/0xb0
> > <4>[  355.438447] Modules linked in: snd_hda_intel i915 mei_hdcp mei_pxp drm_display_helper drm_kms_helper vgem drm_shmem_helper snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm prime_numbers ttm drm_buddy syscopyarea sysfillrect sysimgblt fb_sys_fops fuse x86_pkg_temp_thermal coretemp kvm_intel btusb btrtl btbcm btintel kvm irqbypass bluetooth crct10dif_pclmul crc32_pclmul ecdh_generic ghash_clmulni_intel ecc e1000e mei_me i2c_i801 ptp mei i2c_smbus pps_core lpc_ich video wmi [last unloaded: drm_kms_helper]
> > <4>[  355.438521] CPU: 0 PID: 6211 Comm: rtcwake Tainted: G     U             6.1.0-rc2-CI_DRM_12295-g3844a56a0922+ #1
> > <4>[  355.438526] Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0385.2020.0519.1558 05/19/2020
> > <4>[  355.438530] RIP: 0010:__set_task_frozen+0x86/0xb0
> > <4>[  355.438536] Code: 83 60 09 00 00 85 c0 74 2a 48 89 df e8 ac 02 9b 00 8b 93 38 05 00 00 48 8d b3 48 07 00 00 48 c7 c7 a0 62 2b 82 e8 ee c1 9a 00 <0f> 0b c6 05 51 75 e3 02 01 c7 43 18 00 80 00 00 b8 00 80 00 00 5b
> > <4>[  355.438541] RSP: 0018:ffffc900012cbcf0 EFLAGS: 00010086
> > <4>[  355.438546] RAX: 0000000000000000 RBX: ffff88810d090040 RCX: 0000000000000004
> > <4>[  355.438550] RDX: 0000000000000004 RSI: 00000000fffff5de RDI: 00000000ffffffff
> > <4>[  355.438553] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000fffff5de
> > <4>[  355.438557] R10: 00000000002335f8 R11: ffffc900012cbb88 R12: 0000000000000246
> > <4>[  355.438561] R13: ffffffff81165430 R14: 0000000000000000 R15: ffff88810d090040
> > <4>[  355.438565] FS:  00007fcfa43c7740(0000) GS:ffff888446800000(0000) knlGS:0000000000000000
> > <4>[  355.438569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  355.438582] CR2: 00007fceb380f6b8 CR3: 0000000117c5c004 CR4: 00000000003706f0
> > <4>[  355.438586] Call Trace:
> > <4>[  355.438589]  <TASK>
> > <4>[  355.438592]  task_call_func+0xc4/0xe0
> > <4>[  355.438600]  freeze_task+0x84/0xe0
> > <4>[  355.438607]  try_to_freeze_tasks+0xac/0x260
> > <4>[  355.438616]  freeze_processes+0x56/0xb0
> > <4>[  355.438622]  pm_suspend.cold.7+0x1d9/0x31c
> > <4>[  355.438629]  state_store+0x7b/0xe0
> > <4>[  355.438637]  kernfs_fop_write_iter+0x124/0x1c0
> > <4>[  355.438644]  vfs_write+0x34f/0x4e0
> > <4>[  355.438655]  ksys_write+0x57/0xd0
> > <4>[  355.438663]  do_syscall_64+0x3a/0x90
> > <4>[  355.438670]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Oh I think I see what's going on.
> 
> It's a very narrow race between schedule() and task_call_func().
> 
>   CPU0						CPU1
> 
>   __schedule()
>     rq_lock();
>     prev_state = READ_ONCE(prev->__state);
>     if (... && prev_state) {
>       deactivate_tasl(rq, prev, ...)
>         prev->on_rq = 0;
> 
> 						task_call_func()
> 						  raw_spin_lock_irqsave(p->pi_lock);
> 						  state = READ_ONCE(p->__state);
> 						  smp_rmb();
> 						  if (... || p->on_rq) // false!!!
> 						    rq = __task_rq_lock()
> 
> 						  ret = func();
> 
>     next = pick_next_task();
>     rq = context_switch(prev, next)
>       prepare_lock_switch()
>         spin_release(&__rq_lockp(rq)->dep_map...)
> 
> 
> 
> So while the task is on it's way out, it still holds rq->lock for a
> little while, and right then task_call_func() comes in and figures it
> doesn't need rq->lock anymore (because the task is already dequeued --
> but still running there) and then the __set_task_frozen() thing observes
> it's holding rq->lock and yells murder.
> 
> Could you please give the below a spin?

Thanks. I've added this to our CI branch. I'll try to keep and eye
on it in the coming days and let you know if anything still trips.
And I'll report back maybe ~middle of next week if we haven't caught
anything by then.

> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cb2aa2b54c7a..f519f44cd4c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4200,6 +4200,37 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	return success;
>  }
>  
> +static bool __task_needs_rq_lock(struct task_struct *p)
> +{
> +	unsigned int state = READ_ONCE(p->__state);
> +
> +	/*
> +	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> +	 * the task is blocked. Make sure to check @state since ttwu() can drop
> +	 * locks at the end, see ttwu_queue_wakelist().
> +	 */
> +	if (state == TASK_RUNNING || state == TASK_WAKING)
> +		return true;
> +
> +	/*
> +	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> +	 * possible to, falsely, observe p->on_rq == 0.
> +	 *
> +	 * See try_to_wake_up() for a longer comment.
> +	 */
> +	smp_rmb();
> +	if (p->on_rq)
> +		return true;
> +
> +#ifdef CONFIG_SMP
> +	smp_rmb();
> +	if (p->on_cpu)
> +		return true;
> +#endif
> +
> +	return false;
> +}
> +
>  /**
>   * task_call_func - Invoke a function on task in fixed state
>   * @p: Process for which the function is to be invoked, can be @current.
> @@ -4217,28 +4248,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  int task_call_func(struct task_struct *p, task_call_f func, void *arg)
>  {
>  	struct rq *rq = NULL;
> -	unsigned int state;
>  	struct rq_flags rf;
>  	int ret;
>  
>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>  
> -	state = READ_ONCE(p->__state);
> -
> -	/*
> -	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> -	 * possible to, falsely, observe p->on_rq == 0.
> -	 *
> -	 * See try_to_wake_up() for a longer comment.
> -	 */
> -	smp_rmb();
> -
> -	/*
> -	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> -	 * the task is blocked. Make sure to check @state since ttwu() can drop
> -	 * locks at the end, see ttwu_queue_wakelist().
> -	 */
> -	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
> +	if (__task_needs_rq_lock(p))
>  		rq = __task_rq_lock(p, &rf);
>  
>  	/*

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-27 13:09             ` Ville Syrjälä
@ 2022-10-27 16:53               ` Peter Zijlstra
  2022-11-02 16:57                 ` Ville Syrjälä
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-10-27 16:53 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Thu, Oct 27, 2022 at 04:09:01PM +0300, Ville Syrjälä wrote:
> On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:

> > Could you please give the below a spin?
> 
> Thanks. I've added this to our CI branch. I'll try to keep and eye
> on it in the coming days and let you know if anything still trips.
> And I'll report back maybe ~middle of next week if we haven't caught
> anything by then.

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-10-27 16:53               ` Peter Zijlstra
@ 2022-11-02 16:57                 ` Ville Syrjälä
  2022-11-02 22:16                   ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-11-02 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Thu, Oct 27, 2022 at 06:53:23PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 27, 2022 at 04:09:01PM +0300, Ville Syrjälä wrote:
> > On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> 
> > > Could you please give the below a spin?
> > 
> > Thanks. I've added this to our CI branch. I'll try to keep and eye
> > on it in the coming days and let you know if anything still trips.
> > And I'll report back maybe ~middle of next week if we haven't caught
> > anything by then.
> 
> Thanks!

Looks like we haven't caught anything since I put the patch in.
So the fix seems good.

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-11-02 16:57                 ` Ville Syrjälä
@ 2022-11-02 22:16                   ` Peter Zijlstra
  2022-11-07 11:47                     ` Ville Syrjälä
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2022-11-02 22:16 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Nov 02, 2022 at 06:57:51PM +0200, Ville Syrjälä wrote:
> On Thu, Oct 27, 2022 at 06:53:23PM +0200, Peter Zijlstra wrote:
> > On Thu, Oct 27, 2022 at 04:09:01PM +0300, Ville Syrjälä wrote:
> > > On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> > 
> > > > Could you please give the below a spin?
> > > 
> > > Thanks. I've added this to our CI branch. I'll try to keep and eye
> > > on it in the coming days and let you know if anything still trips.
> > > And I'll report back maybe ~middle of next week if we haven't caught
> > > anything by then.
> > 
> > Thanks!
> 
> Looks like we haven't caught anything since I put the patch in.
> So the fix seems good.

While writing up the Changelog, it occured to me it might be possible to
fix another way, could I bother you to also run the below patch for a
bit?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..daff72f00385 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4200,6 +4200,40 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	return success;
 }
 
+static bool __task_needs_rq_lock(struct task_struct *p)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	/*
+	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
+	 * the task is blocked. Make sure to check @state since ttwu() can drop
+	 * locks at the end, see ttwu_queue_wakelist().
+	 */
+	if (state == TASK_RUNNING || state == TASK_WAKING)
+		return true;
+
+	/*
+	 * Ensure we load p->on_rq after p->__state, otherwise it would be
+	 * possible to, falsely, observe p->on_rq == 0.
+	 *
+	 * See try_to_wake_up() for a longer comment.
+	 */
+	smp_rmb();
+	if (p->on_rq)
+		return true;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Ensure the task has finished __schedule() and will not be referenced
+	 * anymore. Again, see try_to_wake_up() for a longer comment.
+	 */
+	smp_rmb();
+	smp_cond_load_acquire(&p->on_cpu, !VAL);
+#endif
+
+	return false;
+}
+
 /**
  * task_call_func - Invoke a function on task in fixed state
  * @p: Process for which the function is to be invoked, can be @current.
@@ -4217,28 +4251,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 int task_call_func(struct task_struct *p, task_call_f func, void *arg)
 {
 	struct rq *rq = NULL;
-	unsigned int state;
 	struct rq_flags rf;
 	int ret;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 
-	state = READ_ONCE(p->__state);
-
-	/*
-	 * Ensure we load p->on_rq after p->__state, otherwise it would be
-	 * possible to, falsely, observe p->on_rq == 0.
-	 *
-	 * See try_to_wake_up() for a longer comment.
-	 */
-	smp_rmb();
-
-	/*
-	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
-	 * the task is blocked. Make sure to check @state since ttwu() can drop
-	 * locks at the end, see ttwu_queue_wakelist().
-	 */
-	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
+	if (__task_needs_rq_lock(p))
 		rq = __task_rq_lock(p, &rf);
 
 	/*

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2022-11-02 22:16                   ` Peter Zijlstra
@ 2022-11-07 11:47                     ` Ville Syrjälä
  2022-11-10 20:27                       ` [Intel-gfx] [PATCH v3 6/6] freezer, sched: " Ville Syrjälä
  0 siblings, 1 reply; 59+ messages in thread
From: Ville Syrjälä @ 2022-11-07 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, ebiederm, bigeasy, Will Deacon, linux-kernel, tj,
	linux-pm, intel-gfx

On Wed, Nov 02, 2022 at 11:16:48PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 02, 2022 at 06:57:51PM +0200, Ville Syrjälä wrote:
> > On Thu, Oct 27, 2022 at 06:53:23PM +0200, Peter Zijlstra wrote:
> > > On Thu, Oct 27, 2022 at 04:09:01PM +0300, Ville Syrjälä wrote:
> > > > On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > Could you please give the below a spin?
> > > > 
> > > > Thanks. I've added this to our CI branch. I'll try to keep and eye
> > > > on it in the coming days and let you know if anything still trips.
> > > > And I'll report back maybe ~middle of next week if we haven't caught
> > > > anything by then.
> > > 
> > > Thanks!
> > 
> > Looks like we haven't caught anything since I put the patch in.
> > So the fix seems good.
> 
> While writing up the Changelog, it occured to me it might be possible to
> fix another way, could I bother you to also run the below patch for a
> bit?

I swapped in the new patch to the CI branch. I'll check back
after a few days.

> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cb2aa2b54c7a..daff72f00385 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4200,6 +4200,40 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	return success;
>  }
>  
> +static bool __task_needs_rq_lock(struct task_struct *p)
> +{
> +	unsigned int state = READ_ONCE(p->__state);
> +
> +	/*
> +	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> +	 * the task is blocked. Make sure to check @state since ttwu() can drop
> +	 * locks at the end, see ttwu_queue_wakelist().
> +	 */
> +	if (state == TASK_RUNNING || state == TASK_WAKING)
> +		return true;
> +
> +	/*
> +	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> +	 * possible to, falsely, observe p->on_rq == 0.
> +	 *
> +	 * See try_to_wake_up() for a longer comment.
> +	 */
> +	smp_rmb();
> +	if (p->on_rq)
> +		return true;
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Ensure the task has finished __schedule() and will not be referenced
> +	 * anymore. Again, see try_to_wake_up() for a longer comment.
> +	 */
> +	smp_rmb();
> +	smp_cond_load_acquire(&p->on_cpu, !VAL);
> +#endif
> +
> +	return false;
> +}
> +
>  /**
>   * task_call_func - Invoke a function on task in fixed state
>   * @p: Process for which the function is to be invoked, can be @current.
> @@ -4217,28 +4251,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  int task_call_func(struct task_struct *p, task_call_f func, void *arg)
>  {
>  	struct rq *rq = NULL;
> -	unsigned int state;
>  	struct rq_flags rf;
>  	int ret;
>  
>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>  
> -	state = READ_ONCE(p->__state);
> -
> -	/*
> -	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> -	 * possible to, falsely, observe p->on_rq == 0.
> -	 *
> -	 * See try_to_wake_up() for a longer comment.
> -	 */
> -	smp_rmb();
> -
> -	/*
> -	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> -	 * the task is blocked. Make sure to check @state since ttwu() can drop
> -	 * locks at the end, see ttwu_queue_wakelist().
> -	 */
> -	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
> +	if (__task_needs_rq_lock(p))
>  		rq = __task_rq_lock(p, &rf);
>  
>  	/*

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Intel-gfx] [PATCH v3 6/6] freezer, sched: Rewrite core freezer logic
  2022-11-07 11:47                     ` Ville Syrjälä
@ 2022-11-10 20:27                       ` Ville Syrjälä
  0 siblings, 0 replies; 59+ messages in thread
From: Ville Syrjälä @ 2022-11-10 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-pm, linux-kernel, bigeasy, rjw, oleg, rostedt, mingo,
	mgorman, intel-gfx, tj, Will Deacon, dietmar.eggemann, ebiederm

On Mon, Nov 07, 2022 at 01:47:23PM +0200, Ville Syrjälä wrote:
> On Wed, Nov 02, 2022 at 11:16:48PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 02, 2022 at 06:57:51PM +0200, Ville Syrjälä wrote:
> > > On Thu, Oct 27, 2022 at 06:53:23PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Oct 27, 2022 at 04:09:01PM +0300, Ville Syrjälä wrote:
> > > > > On Wed, Oct 26, 2022 at 01:43:00PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > > > Could you please give the below a spin?
> > > > > 
> > > > > Thanks. I've added this to our CI branch. I'll try to keep and eye
> > > > > on it in the coming days and let you know if anything still trips.
> > > > > And I'll report back maybe ~middle of next week if we haven't caught
> > > > > anything by then.
> > > > 
> > > > Thanks!
> > > 
> > > Looks like we haven't caught anything since I put the patch in.
> > > So the fix seems good.
> > 
> > While writing up the Changelog, it occured to me it might be possible to
> > fix another way, could I bother you to also run the below patch for a
> > bit?
> 
> I swapped in the new patch to the CI branch. I'll check back
> after a few days.

CI hasn't had anything new to report AFAICS, so looks like this
version is good as well.

> 
> > 
> > ---
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index cb2aa2b54c7a..daff72f00385 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4200,6 +4200,40 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	return success;
> >  }
> >  
> > +static bool __task_needs_rq_lock(struct task_struct *p)
> > +{
> > +	unsigned int state = READ_ONCE(p->__state);
> > +
> > +	/*
> > +	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> > +	 * the task is blocked. Make sure to check @state since ttwu() can drop
> > +	 * locks at the end, see ttwu_queue_wakelist().
> > +	 */
> > +	if (state == TASK_RUNNING || state == TASK_WAKING)
> > +		return true;
> > +
> > +	/*
> > +	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> > +	 * possible to, falsely, observe p->on_rq == 0.
> > +	 *
> > +	 * See try_to_wake_up() for a longer comment.
> > +	 */
> > +	smp_rmb();
> > +	if (p->on_rq)
> > +		return true;
> > +
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * Ensure the task has finished __schedule() and will not be referenced
> > +	 * anymore. Again, see try_to_wake_up() for a longer comment.
> > +	 */
> > +	smp_rmb();
> > +	smp_cond_load_acquire(&p->on_cpu, !VAL);
> > +#endif
> > +
> > +	return false;
> > +}
> > +
> >  /**
> >   * task_call_func - Invoke a function on task in fixed state
> >   * @p: Process for which the function is to be invoked, can be @current.
> > @@ -4217,28 +4251,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  int task_call_func(struct task_struct *p, task_call_f func, void *arg)
> >  {
> >  	struct rq *rq = NULL;
> > -	unsigned int state;
> >  	struct rq_flags rf;
> >  	int ret;
> >  
> >  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> >  
> > -	state = READ_ONCE(p->__state);
> > -
> > -	/*
> > -	 * Ensure we load p->on_rq after p->__state, otherwise it would be
> > -	 * possible to, falsely, observe p->on_rq == 0.
> > -	 *
> > -	 * See try_to_wake_up() for a longer comment.
> > -	 */
> > -	smp_rmb();
> > -
> > -	/*
> > -	 * Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
> > -	 * the task is blocked. Make sure to check @state since ttwu() can drop
> > -	 * locks at the end, see ttwu_queue_wakelist().
> > -	 */
> > -	if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
> > +	if (__task_needs_rq_lock(p))
> >  		rq = __task_rq_lock(p, &rf);
> >  
> >  	/*
> 
> -- 
> Ville Syrjälä
> Intel

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2021-10-09 10:08 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
@ 2021-10-18 13:36   ` Peter Zijlstra
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-18 13:36 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, Will Deacon
  Cc: linux-kernel, tj, linux-pm

On Sat, Oct 09, 2021 at 12:08:00PM +0200, Peter Zijlstra wrote:

> +static inline unsigned int __can_freeze(struct task_struct *p)
> +{
> +	unsigned int state = READ_ONCE(p->__state);
> +
> +	if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED)))
> +		return 0;
> +
> +	/*
> +	 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they
> +	 * can suffer spurious wakeups.
> +	 */
> +	if (state & TASK_FREEZABLE)
> +		WARN_ON_ONCE(!(state & TASK_NORMAL));
> +
> +#ifdef CONFIG_LOCKDEP
> +	/*
> +	 * It's dangerous to freeze with locks held; there be dragons there.
> +	 */
> +	if (!(state & __TASK_FREEZABLE_UNSAFE))
> +		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> +#endif
> +
> +	return TASK_FROZEN;
> +}
> +
> +/* See task_cond_set_special_state(); serializes against ttwu() */
> +static bool __freeze_task(struct task_struct *p)
> +{
> +	return task_cond_set_special_state(p, __can_freeze(p));
> +}

Will found an issue with this, notably task_cond_set_special() only
takes ->pi_lock and as such doesn't serialize against __schedule(),
which then yields the following fun scenario:


	__schedule()					__freeze_task()


	prev_state = READ_ONCE(prev->__state); // INTERRUPTIBLE

							task_cond_set_special_state()
							  ...
							  WRITE_ONCE(prev->__state, TASK_FROZEN);

	if (signal_pending_state(prev_state, prev)) // SIGPENDING
	  WRITE_ONCE(prev->__state, TASK_RUNNING)




And *whoopsie*, freezer things we're frozen, but we're back in the game.


AFAICT the below, which uses the brand-spanking-new task_call_func()
which currently sits in tip/sched/core to also serialize against
rq->lock should avoid this scenario.


--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -233,25 +233,6 @@ struct task_group;
 	} while (0)
 
 /*
- * task_cond_set_special_state() is a cmpxchg like operation on task->state.
- *
- * This operation isn't safe in general and should only be used to transform
- * one (special) blocked state into another, such as:
- *   TASK_STOPPED <-> TASK_FROZEN.
- */
-#define task_cond_set_special_state(task, cond_state)			\
-	({								\
-		struct task_struct *__p = (task);			\
-		unsigned long __flags; /* may shadow */			\
-		unsigned int __state;					\
-		raw_spin_lock_irqsave(&__p->pi_lock, __flags);		\
-		if ((__state = (cond_state)))				\
-			WRITE_ONCE(__p->__state, __state);		\
-		raw_spin_unlock_irqrestore(&__p->pi_lock, __flags);	\
-		!!__state;						\
-	})
-
-/*
  * PREEMPT_RT specific variants for "sleeping" spin/rwlocks
  *
  * RT's spin/rwlock substitutions are state preserving. The state of the
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -101,7 +101,7 @@ static void fake_signal_wake_up(struct t
 	}
 }
 
-static inline unsigned int __can_freeze(struct task_struct *p)
+static int __set_task_frozen(struct task_struct *p, void *arg)
 {
 	unsigned int state = READ_ONCE(p->__state);
 
@@ -123,13 +123,14 @@ static inline unsigned int __can_freeze(
 		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
 #endif
 
+	WRITE_ONCE(p->__state, TASK_FROZEN);
 	return TASK_FROZEN;
 }
 
-/* See task_cond_set_special_state(); serializes against ttwu() */
 static bool __freeze_task(struct task_struct *p)
 {
-	return task_cond_set_special_state(p, __can_freeze(p));
+	/* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */
+	return task_call_func(p, __set_task_frozen, NULL);
 }
 
 /**
@@ -169,7 +170,7 @@ bool freeze_task(struct task_struct *p)
  * reflects that and the below will refuse to restore the special state and
  * instead issue the wakeup.
  */
-static inline unsigned int __thaw_special(struct task_struct *p)
+static int __set_task_special(struct task_struct *p, void *arg)
 {
 	unsigned int state = 0;
 
@@ -188,6 +189,9 @@ static inline unsigned int __thaw_specia
 		state = TASK_STOPPED;
 	}
 
+	if (state)
+		WRITE_ONCE(p->__state, state);
+
 	return state;
 }
 
@@ -200,7 +204,8 @@ void __thaw_task(struct task_struct *p)
 		goto unlock;
 
 	if (lock_task_sighand(p, &flags2)) {
-		bool ret = task_cond_set_special_state(p, __thaw_special(p));
+		/* TASK_FROZEN -> TASK_{STOPPED,TRACED} */
+		bool ret = task_call_func(p, __set_task_special, NULL);
 		unlock_task_sighand(p, &flags2);
 		if (ret)
 			goto unlock;
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,6 +193,17 @@ static bool looks_like_a_spurious_pid(st
 	return true;
 }
 
+static int __set_task_traced(struct task_struct *task, void *arg)
+{
+	unsigned int *state = arg;
+
+	if (!(task->__state & __TASK_TRACED))
+		return 0;
+
+	WRITE_ONCE(task->__state, *state);
+	return *state;
+}
+
 /* Ensure that nothing can wake it up, even SIGKILL */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
@@ -205,10 +216,12 @@ static bool ptrace_freeze_traced(struct
 	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
+		unsigned int state = __TASK_TRACED;
+
 		task->ptrace &= ~PT_STOPPED_MASK;
 		task->ptrace |= PT_STOPPED;
 		/* *TASK_TRACED -> __TASK_TRACED */
-		task_cond_set_special_state(task, !!(task->__state & __TASK_TRACED) * __TASK_TRACED);
+		task_call_func(task, __set_task_traced, &state);
 		ret = true;
 	}
 	spin_unlock_irq(&task->sighand->siglock);
@@ -233,9 +246,11 @@ static void ptrace_unfreeze_traced(struc
 			task->ptrace &= ~PT_STOPPED_MASK;
 			wake_up_state(task, __TASK_TRACED);
 		} else {
+			unsigned int state = TASK_TRACED;
+
 			task->ptrace |= PT_STOPPED_MASK;
 			/* *TASK_TRACED -> TASK_TRACED */
-			task_cond_set_special_state(task, !!(task->__state & __TASK_TRACED) * TASK_TRACED);
+			task_call_func(task, __set_task_traced, &state);
 		}
 	}
 	spin_unlock_irq(&task->sighand->siglock);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3181,7 +3181,7 @@ int migrate_swap(struct task_struct *cur
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-static inline __wti_match(struct task_struct *p, unsigned int match_state)
+static inline bool __wti_match(struct task_struct *p, unsigned int match_state)
 {
 	unsigned int state = READ_ONCE(p->__state);
 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic
  2021-10-09 10:07 [PATCH v3 0/6] Freezer rewrite Peter Zijlstra
@ 2021-10-09 10:08 ` Peter Zijlstra
  2021-10-18 13:36   ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2021-10-09 10:08 UTC (permalink / raw)
  To: rjw, oleg, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	mgorman, Will Deacon
  Cc: linux-kernel, peterz, tj, linux-pm

This here rewrites the core freezer to behave better wrt thawing. By
replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/android/binder.c       |    4 
 drivers/media/pci/pt3/pt3.c    |    4 
 fs/cifs/inode.c                |    4 
 fs/cifs/transport.c            |    5 
 fs/coredump.c                  |    5 
 fs/nfs/file.c                  |    3 
 fs/nfs/inode.c                 |   12 --
 fs/nfs/nfs3proc.c              |    3 
 fs/nfs/nfs4proc.c              |   14 +-
 fs/nfs/nfs4state.c             |    3 
 fs/nfs/pnfs.c                  |    4 
 fs/xfs/xfs_trans_ail.c         |    8 -
 include/linux/completion.h     |    1 
 include/linux/freezer.h        |  244 +----------------------------------------
 include/linux/sched.h          |   41 +++---
 include/linux/sunrpc/sched.h   |    7 -
 include/linux/wait.h           |   40 +++++-
 kernel/cgroup/legacy_freezer.c |   23 +--
 kernel/exit.c                  |    4 
 kernel/fork.c                  |    5 
 kernel/freezer.c               |  132 +++++++++++++++-------
 kernel/futex.c                 |    4 
 kernel/hung_task.c             |    4 
 kernel/power/main.c            |    6 -
 kernel/power/process.c         |   10 -
 kernel/ptrace.c                |    2 
 kernel/sched/completion.c      |    9 +
 kernel/sched/core.c            |   19 ++-
 kernel/signal.c                |   14 +-
 kernel/time/hrtimer.c          |    4 
 kernel/umh.c                   |   20 +--
 mm/khugepaged.c                |    4 
 net/sunrpc/sched.c             |   12 --
 net/unix/af_unix.c             |    8 -
 34 files changed, 274 insertions(+), 408 deletions(-)

--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -3722,10 +3722,9 @@ static int binder_wait_for_work(struct b
 	struct binder_proc *proc = thread->proc;
 	int ret = 0;
 
-	freezer_do_not_count();
 	binder_inner_proc_lock(proc);
 	for (;;) {
-		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		if (binder_has_work_ilocked(thread, do_proc_work))
 			break;
 		if (do_proc_work)
@@ -3742,7 +3741,6 @@ static int binder_wait_for_work(struct b
 	}
 	finish_wait(&thread->wait, &wait);
 	binder_inner_proc_unlock(proc);
-	freezer_count();
 
 	return ret;
 }
--- a/drivers/media/pci/pt3/pt3.c
+++ b/drivers/media/pci/pt3/pt3.c
@@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
 		pt3_proc_dma(adap);
 
 		delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		freezable_schedule_hrtimeout_range(&delay,
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_hrtimeout_range(&delay,
 					PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
 					HRTIMER_MODE_REL);
 	}
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2285,7 +2285,7 @@ cifs_invalidate_mapping(struct inode *in
 static int
 cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -2303,7 +2303,7 @@ cifs_revalidate_mapping(struct inode *in
 		return 0;
 
 	rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
-				     TASK_KILLABLE);
+				     TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (rc)
 		return rc;
 
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -760,8 +760,9 @@ wait_for_response(struct TCP_Server_Info
 {
 	int error;
 
-	error = wait_event_freezekillable_unsafe(server->response_q,
-				    midQ->mid_state != MID_REQUEST_SUBMITTED);
+	error = wait_event_state(server->response_q,
+				 midQ->mid_state != MID_REQUEST_SUBMITTED,
+				 (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
 	if (error < 0)
 		return -ERESTARTSYS;
 
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -465,9 +465,8 @@ static int coredump_wait(int exit_code,
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
-		freezer_do_not_count();
-		wait_for_completion(&core_state->startup);
-		freezer_count();
+		wait_for_completion_state(&core_state->startup,
+					  TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 		/*
 		 * Wait for all the threads to become inactive, so that
 		 * all the thread context (extended register state, like
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -558,7 +558,8 @@ static vm_fault_t nfs_vm_page_mkwrite(st
 	nfs_fscache_wait_on_page_write(NFS_I(inode), page);
 
 	wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
+			   nfs_wait_bit_killable,
+			   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 
 	lock_page(page);
 	mapping = page_file_mapping(page);
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat
 	return nfs_fileid_to_ino_t(fattr->fileid);
 }
 
-static int nfs_wait_killable(int mode)
+int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
 }
-
-int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
-{
-	return nfs_wait_killable(mode);
-}
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
 /**
@@ -1327,7 +1322,8 @@ int nfs_clear_invalid_mapping(struct add
 	 */
 	for (;;) {
 		ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
-					 nfs_wait_bit_killable, TASK_KILLABLE);
+					 nfs_wait_bit_killable,
+					 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (ret)
 			goto out;
 		spin_lock(&inode->i_lock);
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt,
 		res = rpc_call_sync(clnt, msg, flags);
 		if (res != -EJUKEBOX)
 			break;
-		freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+		__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+		schedule_timeout(NFS_JUKEBOX_RETRY_TIME);
 		res = -ERESTARTSYS;
 	} while (!fatal_signal_pending(current));
 	return res;
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -411,8 +411,8 @@ static int nfs4_delay_killable(long *tim
 {
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
+	__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!__fatal_signal_pending(current))
 		return 0;
 	return -EINTR;
@@ -422,7 +422,8 @@ static int nfs4_delay_interruptible(long
 {
 	might_sleep();
 
-	freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!signal_pending(current))
 		return 0;
 	return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
@@ -7320,7 +7321,8 @@ nfs4_retry_setlk_simple(struct nfs4_stat
 		status = nfs4_proc_setlk(state, cmd, request);
 		if ((status != -EAGAIN) || IS_SETLK(cmd))
 			break;
-		freezable_schedule_timeout_interruptible(timeout);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_timeout(timeout);
 		timeout *= 2;
 		timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
 		status = -ERESTARTSYS;
@@ -7388,10 +7390,8 @@ nfs4_retry_setlk(struct nfs4_state *stat
 			break;
 
 		status = -ERESTARTSYS;
-		freezer_do_not_count();
-		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
+		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 			   NFS4_LOCK_MAXTIMEOUT);
-		freezer_count();
 	} while (!signalled());
 
 	remove_wait_queue(q, &waiter.wait);
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1307,7 +1307,8 @@ int nfs4_wait_clnt_recover(struct nfs_cl
 
 	refcount_inc(&clp->cl_count);
 	res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
-				 nfs_wait_bit_killable, TASK_KILLABLE);
+				 nfs_wait_bit_killable,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (res)
 		goto out;
 	if (clp->cl_cons_state < 0)
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1896,7 +1896,7 @@ static int pnfs_prepare_to_retry_layoutg
 	pnfs_layoutcommit_inode(lo->plh_inode, false);
 	return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
 				   nfs_wait_bit_killable,
-				   TASK_KILLABLE);
+				   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
 
 static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
@@ -3176,7 +3176,7 @@ pnfs_layoutcommit_inode(struct inode *in
 		status = wait_on_bit_lock_action(&nfsi->flags,
 				NFS_INO_LAYOUTCOMMITTING,
 				nfs_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (status)
 			goto out;
 	}
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -590,9 +590,9 @@ xfsaild(
 
 	while (1) {
 		if (tout && tout <= 20)
-			set_current_state(TASK_KILLABLE);
+			set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
 		else
-			set_current_state(TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 		/*
 		 * Check kthread_should_stop() after we set the task state to
@@ -641,14 +641,14 @@ xfsaild(
 		    ailp->ail_target == ailp->ail_target_prev &&
 		    list_empty(&ailp->ail_buf_list)) {
 			spin_unlock(&ailp->ail_lock);
-			freezable_schedule();
+			schedule();
 			tout = 0;
 			continue;
 		}
 		spin_unlock(&ailp->ail_lock);
 
 		if (tout)
-			freezable_schedule_timeout(msecs_to_jiffies(tout));
+			schedule_timeout(msecs_to_jiffies(tout));
 
 		__set_current_state(TASK_RUNNING);
 
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -103,6 +103,7 @@ extern void wait_for_completion(struct c
 extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
+extern int wait_for_completion_state(struct completion *x, unsigned int state);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
 						   unsigned long timeout);
 extern unsigned long wait_for_completion_io_timeout(struct completion *x,
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs
 /*
  * Check if a process has been frozen
  */
-static inline bool frozen(struct task_struct *p)
-{
-	return p->flags & PF_FROZEN;
-}
+extern bool frozen(struct task_struct *p);
 
 extern bool freezing_slow_path(struct task_struct *p);
 
@@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct ta
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
 extern void thaw_kernel_threads(void);
 
-/*
- * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
- * If try_to_freeze causes a lockdep warning it means the caller may deadlock
- */
-static inline bool try_to_freeze_unsafe(void)
+static inline bool try_to_freeze(void)
 {
 	might_sleep();
 	if (likely(!freezing(current)))
 		return false;
-	return __refrigerator(false);
-}
-
-static inline bool try_to_freeze(void)
-{
 	if (!(current->flags & PF_NOFREEZE))
 		debug_check_no_locks_held();
-	return try_to_freeze_unsafe();
+	return __refrigerator(false);
 }
 
 extern bool freeze_task(struct task_struct *p);
@@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struc
 }
 #endif /* !CONFIG_CGROUP_FREEZER */
 
-/*
- * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
- * calls wait_for_completion(&vfork) and reset right after it returns from this
- * function.  Next, the parent should call try_to_freeze() to freeze itself
- * appropriately in case the child has exited before the freezing of tasks is
- * complete.  However, we don't want kernel threads to be frozen in unexpected
- * places, so we allow them to block freeze_processes() instead or to set
- * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
- * parent won't really block freeze_processes(), since ____call_usermodehelper()
- * (the child) does a little before exec/exit and it can't be frozen before
- * waking up the parent.
- */
-
-
-/**
- * freezer_do_not_count - tell freezer to ignore %current
- *
- * Tell freezers to ignore the current task when determining whether the
- * target frozen state is reached.  IOW, the current task will be
- * considered frozen enough by freezers.
- *
- * The caller shouldn't do anything which isn't allowed for a frozen task
- * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
- * wrap a scheduling operation and nothing much else.
- */
-static inline void freezer_do_not_count(void)
-{
-	current->flags |= PF_FREEZER_SKIP;
-}
-
-/**
- * freezer_count - tell freezer to stop ignoring %current
- *
- * Undo freezer_do_not_count().  It tells freezers that %current should be
- * considered again and tries to freeze if freezing condition is already in
- * effect.
- */
-static inline void freezer_count(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezer_count_unsafe(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	smp_mb();
-	try_to_freeze_unsafe();
-}
-
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
-/*
- * These functions are intended to be used whenever you want allow a sleeping
- * task to be frozen. Note that neither return any clear indication of
- * whether a freeze event happened while in this function.
- */
-
-/* Like schedule(), but should not block the freezer. */
-static inline void freezable_schedule(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezable_schedule_unsafe(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count_unsafe();
-}
-
-/*
- * Like schedule_timeout(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout_interruptible(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
-{
-	long __retval;
-
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/* Like schedule_timeout_killable(), but should not block the freezer. */
-static inline long freezable_schedule_timeout_killable(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/*
- * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
-		u64 delta, const enum hrtimer_mode mode)
-{
-	int __retval;
-	freezer_do_not_count();
-	__retval = schedule_hrtimeout_range(expires, delta, mode);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Freezer-friendly wrappers around wait_event_interruptible(),
- * wait_event_killable() and wait_event_interruptible_timeout(), originally
- * defined in <linux/wait.h>
- */
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-({									\
-	int __retval;							\
-	freezer_do_not_count();						\
-	__retval = wait_event_killable(wq, (condition));		\
-	freezer_count_unsafe();						\
-	__retval;							\
-})
-
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
@@ -281,35 +83,9 @@ static inline void thaw_kernel_threads(v
 
 static inline bool try_to_freeze(void) { return false; }
 
-static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
-#define freezable_schedule()  schedule()
-
-#define freezable_schedule_unsafe()  schedule()
-
-#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
-
-#define freezable_schedule_timeout_interruptible(timeout)		\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_interruptible_unsafe(timeout)	\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_killable(timeout)			\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_timeout_killable_unsafe(timeout)		\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_hrtimeout_range(expires, delta, mode)	\
-	schedule_hrtimeout_range(expires, delta, mode)
-
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-		wait_event_killable(wq, condition)
-
 #endif /* !CONFIG_FREEZER */
 
 #endif	/* FREEZER_H_INCLUDED */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,25 +80,32 @@ struct task_group;
  */
 
 /* Used in tsk->state: */
-#define TASK_RUNNING			0x0000
-#define TASK_INTERRUPTIBLE		0x0001
-#define TASK_UNINTERRUPTIBLE		0x0002
-#define __TASK_STOPPED			0x0004
-#define __TASK_TRACED			0x0008
+#define TASK_RUNNING			0x000000
+#define TASK_INTERRUPTIBLE		0x000001
+#define TASK_UNINTERRUPTIBLE		0x000002
+#define __TASK_STOPPED			0x000004
+#define __TASK_TRACED			0x000008
 /* Used in tsk->exit_state: */
-#define EXIT_DEAD			0x0010
-#define EXIT_ZOMBIE			0x0020
+#define EXIT_DEAD			0x000010
+#define EXIT_ZOMBIE			0x000020
 #define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
 /* Used in tsk->state again: */
-#define TASK_PARKED			0x0040
-#define TASK_DEAD			0x0080
-#define TASK_WAKEKILL			0x0100
-#define TASK_WAKING			0x0200
-#define TASK_NOLOAD			0x0400
-#define TASK_NEW			0x0800
-/* RT specific auxilliary flag to mark RT lock waiters */
-#define TASK_RTLOCK_WAIT		0x1000
-#define TASK_STATE_MAX			0x2000
+#define TASK_PARKED			0x000040
+#define TASK_DEAD			0x000080
+#define TASK_WAKEKILL			0x000100
+#define TASK_WAKING			0x000200
+#define TASK_NOLOAD			0x000400
+#define TASK_NEW			0x000800
+#define TASK_FREEZABLE			0x001000
+#define __TASK_FREEZABLE_UNSAFE	       (0x002000 * IS_ENABLED(CONFIG_LOCKDEP))
+#define TASK_FROZEN			0x004000
+#define TASK_RTLOCK_WAIT		0x008000
+#define TASK_STATE_MAX			0x010000
+
+/*
+ * DO NOT ADD ANY NEW USERS !
+ */
+#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1695,7 +1702,6 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USED_ASYNC		0x00004000	/* Used async_schedule*(), used by module init */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1707,7 +1713,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -265,7 +265,7 @@ int		rpc_malloc(struct rpc_task *);
 void		rpc_free(struct rpc_task *);
 int		rpciod_up(void);
 void		rpciod_down(void);
-int		__rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int		rpc_wait_for_completion_task(struct rpc_task *task);
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 struct net;
 void		rpc_show_tasks(struct net *);
@@ -276,11 +276,6 @@ extern struct workqueue_struct *rpciod_w
 extern struct workqueue_struct *xprtiod_workqueue;
 void		rpc_prepare_task(struct rpc_task *task);
 
-static inline int rpc_wait_for_completion_task(struct rpc_task *task)
-{
-	return __rpc_wait_for_completion_task(task, NULL);
-}
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
 static inline const char * rpc_qname(const struct rpc_wait_queue *q)
 {
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -335,8 +335,8 @@ do {										\
 } while (0)
 
 #define __wait_event_freezable(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-			    freezable_schedule())
+	___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),	\
+			0, 0, schedule())
 
 /**
  * wait_event_freezable - sleep (or freeze) until a condition gets true
@@ -394,8 +394,8 @@ do {										\
 
 #define __wait_event_freezable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
-		      TASK_INTERRUPTIBLE, 0, timeout,				\
-		      __ret = freezable_schedule_timeout(__ret))
+		      (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,		\
+		      __ret = schedule_timeout(__ret))
 
 /*
  * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
@@ -615,8 +615,8 @@ do {										\
 
 
 #define __wait_event_freezable_exclusive(wq, condition)				\
-	___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,			\
-			freezable_schedule())
+	___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
+			schedule())
 
 #define wait_event_freezable_exclusive(wq, condition)				\
 ({										\
@@ -905,6 +905,34 @@ extern int do_wait_intr_irq(wait_queue_h
 	__ret;									\
 })
 
+#define __wait_event_state(wq, condition, state)				\
+	___wait_event(wq, condition, state, 0, 0, schedule())
+
+/**
+ * wait_event_state - sleep until a condition gets true
+ * @wq_head: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @state: state to sleep in
+ *
+ * The process is put to sleep (@state) until the @condition evaluates to true
+ * or a signal is received.  The @condition is checked each time the waitqueue
+ * @wq_head is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function will return -ERESTARTSYS if it was interrupted by a
+ * signal and 0 if @condition evaluated to true.
+ */
+#define wait_event_state(wq_head, condition, state)				\
+({										\
+	int __ret = 0;								\
+	might_sleep();								\
+	if (!(condition))							\
+		__ret = __wait_event_state(wq_head, condition, state);		\
+	__ret;									\
+})
+
 #define __wait_event_killable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
 		      TASK_KILLABLE, 0, timeout,				\
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -113,7 +113,7 @@ static int freezer_css_online(struct cgr
 
 	if (parent && (parent->state & CGROUP_FREEZING)) {
 		freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 	}
 
 	mutex_unlock(&freezer_mutex);
@@ -134,7 +134,7 @@ static void freezer_css_offline(struct c
 	mutex_lock(&freezer_mutex);
 
 	if (freezer->state & CGROUP_FREEZING)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 
 	freezer->state = 0;
 
@@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup
 			__thaw_task(task);
 		} else {
 			freeze_task(task);
+
 			/* clear FROZEN and propagate upwards */
 			while (freezer && (freezer->state & CGROUP_FROZEN)) {
 				freezer->state &= ~CGROUP_FROZEN;
@@ -271,16 +272,8 @@ static void update_if_frozen(struct cgro
 	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
-		if (freezing(task)) {
-			/*
-			 * freezer_should_skip() indicates that the task
-			 * should be skipped when determining freezing
-			 * completion.  Consider it frozen in addition to
-			 * the usual frozen condition.
-			 */
-			if (!frozen(task) && !freezer_should_skip(task))
-				goto out_iter_end;
-		}
+		if (freezing(task) && !frozen(task))
+			goto out_iter_end;
 	}
 
 	freezer->state |= CGROUP_FROZEN;
@@ -357,7 +350,7 @@ static void freezer_apply_state(struct f
 
 	if (freeze) {
 		if (!(freezer->state & CGROUP_FREEZING))
-			atomic_inc(&system_freezing_cnt);
+			static_branch_inc(&freezer_active);
 		freezer->state |= state;
 		freeze_cgroup(freezer);
 	} else {
@@ -366,9 +359,9 @@ static void freezer_apply_state(struct f
 		freezer->state &= ~state;
 
 		if (!(freezer->state & CGROUP_FREEZING)) {
-			if (was_freezing)
-				atomic_dec(&system_freezing_cnt);
 			freezer->state &= ~CGROUP_FROZEN;
+			if (was_freezing)
+				static_branch_dec(&freezer_active);
 			unfreeze_cgroup(freezer);
 		}
 	}
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -469,10 +469,10 @@ static void exit_mm(void)
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
-			freezable_schedule();
+			schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1348,13 +1348,12 @@ static void complete_vfork_done(struct t
 static int wait_for_vfork_done(struct task_struct *child,
 				struct completion *vfork)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE;
 	int killed;
 
-	freezer_do_not_count();
 	cgroup_enter_frozen();
-	killed = wait_for_completion_killable(vfork);
+	killed = wait_for_completion_state(vfork, state);
 	cgroup_leave_frozen(false);
-	freezer_count();
 
 	if (killed) {
 		task_lock(child);
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,10 +13,11 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
-/* indicate whether PM freezing is in effect, protected by
+/*
+ * indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
  */
 bool pm_freezing;
@@ -29,7 +30,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -52,41 +53,40 @@ bool freezing_slow_path(struct task_stru
 }
 EXPORT_SYMBOL(freezing_slow_path);
 
+bool frozen(struct task_struct *p)
+{
+	return READ_ONCE(p->__state) & TASK_FROZEN;
+}
+
 /* Refrigerator is place where frozen processes are stored :-). */
 bool __refrigerator(bool check_kthr_stop)
 {
-	/* Hmm, should we be allowed to suspend when there are realtime
-	   processes around? */
+	unsigned int state = get_current_state();
 	bool was_frozen = false;
-	unsigned int save = get_current_state();
 
 	pr_debug("%s entered refrigerator\n", current->comm);
 
+	WARN_ON_ONCE(state && !(state & TASK_NORMAL));
+
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		bool freeze;
+
+		set_current_state(TASK_FROZEN);
 
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
+		if (!freeze)
 			break;
+
 		was_frozen = true;
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-	/*
-	 * Restore saved task state before returning.  The mb'd version
-	 * needs to be used; otherwise, it might silently break
-	 * synchronization which depends on ordered task state change.
-	 */
-	set_current_state(save);
-
 	return was_frozen;
 }
 EXPORT_SYMBOL(__refrigerator);
@@ -101,6 +101,37 @@ static void fake_signal_wake_up(struct t
 	}
 }
 
+static inline unsigned int __can_freeze(struct task_struct *p)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED)))
+		return 0;
+
+	/*
+	 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they
+	 * can suffer spurious wakeups.
+	 */
+	if (state & TASK_FREEZABLE)
+		WARN_ON_ONCE(!(state & TASK_NORMAL));
+
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It's dangerous to freeze with locks held; there be dragons there.
+	 */
+	if (!(state & __TASK_FREEZABLE_UNSAFE))
+		WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+#endif
+
+	return TASK_FROZEN;
+}
+
+/* See task_cond_set_special_state(); serializes against ttwu() */
+static bool __freeze_task(struct task_struct *p)
+{
+	return task_cond_set_special_state(p, __can_freeze(p));
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +147,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -137,19 +156,58 @@ bool freeze_task(struct task_struct *p)
 	if (!(p->flags & PF_KTHREAD))
 		fake_signal_wake_up(p);
 	else
-		wake_up_state(p, TASK_INTERRUPTIBLE);
+		wake_up_state(p, TASK_NORMAL);
 
 	spin_unlock_irqrestore(&freezer_lock, flags);
 	return true;
 }
 
+/*
+ * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical
+ * state in p->jobctl and p->ptrace respectively. If either of them got a
+ * wakeup that was missed because TASK_FROZEN, then their canonical state
+ * reflects that and the below will refuse to restore the special state and
+ * instead issue the wakeup.
+ */
+static inline unsigned int __thaw_special(struct task_struct *p)
+{
+	unsigned int state = 0;
+
+	if (p->ptrace & PT_STOPPED) {
+		state = __TASK_TRACED;
+
+		if (p->ptrace & PT_STOPPED_FATAL) {
+			state |= TASK_WAKEKILL;
+			if (__fatal_signal_pending(p))
+				state = 0;
+		}
+
+	} else if ((p->jobctl & JOBCTL_STOP_PENDING) &&
+		   !__fatal_signal_pending(p)) {
+
+		state = TASK_STOPPED;
+	}
+
+	return state;
+}
+
 void __thaw_task(struct task_struct *p)
 {
-	unsigned long flags;
+	unsigned long flags, flags2;
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
-		wake_up_process(p);
+	if (WARN_ON_ONCE(freezing(p)))
+		goto unlock;
+
+	if (lock_task_sighand(p, &flags2)) {
+		bool ret = task_cond_set_special_state(p, __thaw_special(p));
+		unlock_task_sighand(p, &flags2);
+		if (ret)
+			goto unlock;
+	}
+
+	wake_up_state(p, TASK_FROZEN);
+unlock:
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2832,7 +2832,7 @@ static void futex_wait_queue_me(struct f
 	 * queue_me() calls spin_unlock() upon completion, both serializing
 	 * access to the hash list and forcing another memory barrier.
 	 */
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	queue_me(q, hb);
 
 	/* Arm the timer */
@@ -2850,7 +2850,7 @@ static void futex_wait_queue_me(struct f
 		 * is no timeout, or if it has yet to expire.
 		 */
 		if (!timeout || timeout->task)
-			freezable_schedule();
+			schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 }
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -93,8 +93,8 @@ static void check_hung_task(struct task_
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(READ_ONCE(t->__state) & (TASK_FREEZABLE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -24,7 +24,7 @@
 unsigned int lock_system_sleep(void)
 {
 	unsigned int flags = current->flags;
-	current->flags |= PF_FREEZER_SKIP;
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 	return flags;
 }
@@ -48,8 +48,8 @@ void unlock_system_sleep(unsigned int fl
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	if (!(flags & PF_FREEZER_SKIP))
-		current->flags &= ~PF_FREEZER_SKIP;
+	if (!(flags & PF_NOFREEZE))
+		current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -132,7 +130,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(true);
 	pr_info("Freezing user space processes ... ");
@@ -193,7 +191,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -282,7 +282,7 @@ static int ptrace_check_attach(struct ta
 	read_unlock(&tasklist_lock);
 
 	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) { // XXX mooo!!!
+		if (!wait_task_inactive(child, __TASK_TRACED | TASK_FREEZABLE)) {
 			/*
 			 * This can only happen if may_ptrace_stop() fails and
 			 * ptrace_stop() changes ->state back to TASK_RUNNING,
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -247,6 +247,15 @@ int __sched wait_for_completion_killable
 }
 EXPORT_SYMBOL(wait_for_completion_killable);
 
+int __sched wait_for_completion_state(struct completion *x, unsigned int state)
+{
+	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state);
+	if (t == -ERESTARTSYS)
+		return t;
+	return 0;
+}
+EXPORT_SYMBOL(wait_for_completion_state);
+
 /**
  * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
  * @x:  holds the state of this particular completion
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3162,6 +3162,19 @@ int migrate_swap(struct task_struct *cur
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+static inline __wti_match(struct task_struct *p, unsigned int match_state)
+{
+	unsigned int state = READ_ONCE(p->__state);
+
+	if ((match_state & TASK_FREEZABLE) && state == TASK_FROZEN)
+		return true;
+
+	if (state == (match_state & ~TASK_FREEZABLE))
+		return true;
+
+	return false;
+}
+
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3206,7 +3219,7 @@ unsigned long wait_task_inactive(struct
 		 * is actually now running somewhere else!
 		 */
 		while (task_running(rq, p)) {
-			if (match_state && unlikely(READ_ONCE(p->__state) != match_state))
+			if (match_state && __wti_match(p, match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -3221,7 +3234,7 @@ unsigned long wait_task_inactive(struct
 		running = task_running(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
-		if (!match_state || READ_ONCE(p->__state) == match_state)
+		if (!match_state || __wti_match(p, match_state))
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &rf);
 
@@ -6154,7 +6167,7 @@ static void __sched notrace __schedule(u
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2299,7 +2299,7 @@ static void ptrace_stop(int exit_code, i
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
 		preempt_enable_no_resched();
-		freezable_schedule();
+		schedule();
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2479,7 +2479,7 @@ static bool do_signal_stop(int signr)
 
 		/* Now we don't run again until woken by SIGCONT or SIGKILL */
 		cgroup_enter_frozen();
-		freezable_schedule();
+		schedule();
 		return true;
 	} else {
 		/*
@@ -2555,11 +2555,11 @@ static void do_freezer_trap(void)
 	 * immediately (if there is a non-fatal signal pending), and
 	 * put the task into sleep.
 	 */
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	clear_thread_flag(TIF_SIGPENDING);
 	spin_unlock_irq(&current->sighand->siglock);
 	cgroup_enter_frozen();
-	freezable_schedule();
+	schedule();
 }
 
 static int ptrace_signal(int signr, kernel_siginfo_t *info)
@@ -3608,9 +3608,9 @@ static int do_sigtimedwait(const sigset_
 		recalc_sigpending();
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		__set_current_state(TASK_INTERRUPTIBLE);
-		ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
-							 HRTIMER_MODE_REL);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
+					       HRTIMER_MODE_REL);
 		spin_lock_irq(&tsk->sighand->siglock);
 		__set_task_blocked(tsk, &tsk->real_blocked);
 		sigemptyset(&tsk->real_blocked);
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2037,11 +2037,11 @@ static int __sched do_nanosleep(struct h
 	struct restart_block *restart;
 
 	do {
-		set_current_state(TASK_INTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		hrtimer_sleeper_start_expires(t, mode);
 
 		if (likely(t->task))
-			freezable_schedule();
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -404,6 +404,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup)
  */
 int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
 {
+	unsigned int state = TASK_UNINTERRUPTIBLE;
 	DECLARE_COMPLETION_ONSTACK(done);
 	int retval = 0;
 
@@ -437,25 +438,22 @@ int call_usermodehelper_exec(struct subp
 	if (wait == UMH_NO_WAIT)	/* task has freed sub_info */
 		goto unlock;
 
+	if (wait & UMH_KILLABLE)
+		state |= TASK_KILLABLE;
+
 	if (wait & UMH_FREEZABLE)
-		freezer_do_not_count();
+		state |= TASK_FREEZABLE;
 
-	if (wait & UMH_KILLABLE) {
-		retval = wait_for_completion_killable(&done);
-		if (!retval)
-			goto wait_done;
+	retval = wait_for_completion_state(&done, state);
+	if (!retval)
+		goto wait_done;
 
+	if (wait & UMH_KILLABLE) {
 		/* umh_complete() will see NULL and free sub_info */
 		if (xchg(&sub_info->complete, NULL))
 			goto unlock;
-		/* fallthrough, umh_complete() was already called */
 	}
 
-	wait_for_completion(&done);
-
-	if (wait & UMH_FREEZABLE)
-		freezer_count();
-
 wait_done:
 	retval = sub_info->retval;
 out:
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -793,8 +793,8 @@ static void khugepaged_alloc_sleep(void)
 	DEFINE_WAIT(wait);
 
 	add_wait_queue(&khugepaged_wait, &wait);
-	freezable_schedule_timeout_interruptible(
-		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -268,7 +268,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -324,14 +324,12 @@ static int rpc_complete_task(struct rpc_
  * to enforce taking of the wq->lock and hence avoid races with
  * rpc_complete_task().
  */
-int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
+int rpc_wait_for_completion_task(struct rpc_task *task)
 {
-	if (action == NULL)
-		action = rpc_wait_bit_killable;
 	return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
-			action, TASK_KILLABLE);
+			rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
-EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
+EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
 
 /*
  * Make an RPC task runnable.
@@ -938,7 +936,7 @@ static void __rpc_execute(struct rpc_tas
 		trace_rpc_task_sync_sleep(task, task->tk_action);
 		status = out_of_line_wait_on_bit(&task->tk_runstate,
 				RPC_TASK_QUEUED, rpc_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE);
 		if (status < 0) {
 			/*
 			 * When a sync task receives a signal, it exits with
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2442,13 +2442,14 @@ static long unix_stream_data_wait(struct
 				  struct sk_buff *last, unsigned int last_len,
 				  bool freezable)
 {
+	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, state);
 
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (tail != last ||
@@ -2461,10 +2462,7 @@ static long unix_stream_data_wait(struct
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		if (freezable)
-			timeo = freezable_schedule_timeout(timeo);
-		else
-			timeo = schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))



^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2022-11-10 20:27 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22 11:18 [PATCH v3 0/6] Freezer Rewrite Peter Zijlstra
2022-08-22 11:18 ` [PATCH v3 1/6] freezer: Have {,un}lock_system_sleep() save/restore flags Peter Zijlstra
2022-08-23 17:25   ` Rafael J. Wysocki
2022-08-22 11:18 ` [PATCH v3 2/6] freezer,umh: Clean up freezer/initrd interaction Peter Zijlstra
2022-08-23 17:28   ` Rafael J. Wysocki
2022-08-22 11:18 ` [PATCH v3 3/6] sched: Change wait_task_inactive()s match_state Peter Zijlstra
2022-09-04 10:44   ` Ingo Molnar
2022-09-06 10:54     ` Peter Zijlstra
2022-09-07  7:23       ` Ingo Molnar
2022-09-07  9:29       ` Peter Zijlstra
2022-09-07  9:30       ` Peter Zijlstra
2022-08-22 11:18 ` [PATCH v3 4/6] sched/completion: Add wait_for_completion_state() Peter Zijlstra
2022-08-23 17:32   ` Rafael J. Wysocki
2022-08-26 21:54     ` Peter Zijlstra
2022-09-04 10:46   ` Ingo Molnar
2022-09-06 10:24     ` Peter Zijlstra
2022-09-07  7:35       ` Ingo Molnar
2022-09-07  9:24         ` Peter Zijlstra
2022-08-22 11:18 ` [PATCH v3 5/6] sched/wait: Add wait_event_state() Peter Zijlstra
2022-09-04  9:54   ` Ingo Molnar
2022-09-06 11:08     ` Peter Zijlstra
2022-09-07  7:26       ` Ingo Molnar
2022-08-22 11:18 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
2022-08-23 17:36   ` Rafael J. Wysocki
2022-09-04 10:09   ` Ingo Molnar
2022-09-06 11:23     ` Peter Zijlstra
2022-09-07  7:30       ` Ingo Molnar
2022-09-23  7:21   ` Christian Borntraeger
2022-09-23  7:53     ` Christian Borntraeger
2022-09-26  8:06       ` Christian Borntraeger
2022-09-26 10:55         ` Christian Borntraeger
2022-09-26 12:13           ` Peter Zijlstra
2022-09-26 12:32           ` Christian Borntraeger
2022-09-26 12:55             ` Peter Zijlstra
2022-09-26 13:23               ` Christian Borntraeger
2022-09-26 13:37                 ` Peter Zijlstra
2022-09-26 13:54                   ` Christian Borntraeger
2022-09-26 15:49                   ` Christian Borntraeger
2022-09-26 18:06                     ` Peter Zijlstra
2022-09-26 18:22                       ` Peter Zijlstra
2022-09-27  5:35                         ` Christian Borntraeger
2022-09-28  5:44                           ` Christian Borntraeger
2022-10-21 17:22   ` Ville Syrjälä
2022-10-25  4:52     ` Ville Syrjälä
2022-10-25 10:49       ` Peter Zijlstra
2022-10-26 10:32         ` Ville Syrjälä
2022-10-26 11:43           ` Peter Zijlstra
2022-10-26 12:12             ` Peter Zijlstra
2022-10-26 12:14               ` Peter Zijlstra
2022-10-27  5:58             ` Chen Yu
2022-10-27  7:39               ` Peter Zijlstra
2022-10-27 13:09             ` Ville Syrjälä
2022-10-27 16:53               ` Peter Zijlstra
2022-11-02 16:57                 ` Ville Syrjälä
2022-11-02 22:16                   ` Peter Zijlstra
2022-11-07 11:47                     ` Ville Syrjälä
2022-11-10 20:27                       ` [Intel-gfx] [PATCH v3 6/6] freezer, sched: " Ville Syrjälä
  -- strict thread matches above, loose matches on Subject: below --
2021-10-09 10:07 [PATCH v3 0/6] Freezer rewrite Peter Zijlstra
2021-10-09 10:08 ` [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic Peter Zijlstra
2021-10-18 13:36   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).