linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems
@ 2021-05-25 15:14 Will Deacon
  2021-05-25 15:14 ` [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination Will Deacon
                   ` (21 more replies)
  0 siblings, 22 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Hi all,

Here is v7 of the asymmetric 32-bit support patches that I previously
posted here:

  v1: https://lore.kernel.org/r/20201027215118.27003-1-will@kernel.org
  v2: https://lore.kernel.org/r/20201109213023.15092-1-will@kernel.org
  v3: https://lore.kernel.org/r/20201113093720.21106-1-will@kernel.org
  v4: https://lore.kernel.org/r/20201124155039.13804-1-will@kernel.org
  v5: https://lore.kernel.org/r/20201208132835.6151-1-will@kernel.org
  v6: https://lore.kernel.org/r/20210518094725.7701-1-will@kernel.org

There was also a nice LWN writeup in case you've forgotten what this is
about:

	https://lwn.net/Articles/838339/

Changes since v6 include:

  * Bug fix for a pre-existing scheduler migration bug spotted by
    Valentin (nice work!). I've put this patch at the start.

  * Prevent execve() of a 32-bit program from a 64-bit deadline task if
    the forced affinity would violate admission control

  * Fixed a memory leak and missing rcu_read_lock() spotted by Qais
    (thanks!)

  * Handle race between forcing affinity on execve() and CPU hot-unplug

  * Updated documentation

  * Introduce task_cpu_possible() macro as suggested by Peter Z

  * Added some acks/reviewed-by tags, although I didn't include these
    where patches have changed significantly since last time.

Series now based on v5.13-rc3 to avoid conflicting with arm64 cpucaps
rework merged at -rc2.

Cheers,

Will

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: kernel-team@android.com

--->8

Will Deacon (22):
  sched: Favour predetermined active CPU as migration destination
  arm64: cpuinfo: Split AArch32 registers out into a separate struct
  arm64: Allow mismatched 32-bit EL0 support
  KVM: arm64: Kill 32-bit vCPUs on systems with mismatched EL0 support
  arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs
  arm64: Advertise CPUs capable of running 32-bit applications in sysfs
  sched: Introduce task_cpu_possible_mask() to limit fallback rq
    selection
  cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
  sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  sched: Introduce task_struct::user_cpus_ptr to track requested
    affinity
  sched: Split the guts of sched_setaffinity() into a helper function
  sched: Allow task CPU affinity to be restricted on asymmetric systems
  sched: Introduce task_cpus_dl_admissible() to check proposed affinity
  freezer: Add frozen_or_skipped() helper function
  sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  arm64: Implement task_cpu_possible_mask()
  arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit
    EL0
  arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched
    system
  arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0
  arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores
  Documentation: arm64: describe asymmetric 32-bit support

 .../ABI/testing/sysfs-devices-system-cpu      |   9 +
 .../admin-guide/kernel-parameters.txt         |  11 +
 Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++
 Documentation/arm64/index.rst                 |   1 +
 arch/arm64/include/asm/cpu.h                  |  44 ++-
 arch/arm64/include/asm/cpufeature.h           |   8 +-
 arch/arm64/include/asm/elf.h                  |   6 +-
 arch/arm64/include/asm/mmu_context.h          |  13 +
 arch/arm64/kernel/cpufeature.c                | 227 ++++++++---
 arch/arm64/kernel/cpuinfo.c                   |  53 +--
 arch/arm64/kernel/process.c                   |  44 ++-
 arch/arm64/kvm/arm.c                          |  11 +-
 arch/arm64/tools/cpucaps                      |   3 +-
 include/linux/cpuset.h                        |   3 +-
 include/linux/freezer.h                       |   6 +
 include/linux/mmu_context.h                   |  11 +
 include/linux/sched.h                         |  21 ++
 init/init_task.c                              |   1 +
 kernel/cgroup/cpuset.c                        |  53 ++-
 kernel/fork.c                                 |   2 +
 kernel/freezer.c                              |  10 +-
 kernel/hung_task.c                            |   4 +-
 kernel/sched/core.c                           | 351 ++++++++++++++----
 kernel/sched/sched.h                          |   1 +
 24 files changed, 850 insertions(+), 197 deletions(-)
 create mode 100644 Documentation/arm64/asymmetric-32bit.rst

-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-26 11:14   ` Valentin Schneider
  2021-05-25 15:14 ` [PATCH v7 02/22] arm64: cpuinfo: Split AArch32 registers out into a separate struct Will Deacon
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team, Valentin Schneider

Since commit 6d337eab041d ("sched: Fix migrate_disable() vs
set_cpus_allowed_ptr()"), the migration stopper thread is left to
determine the destination CPU of the running task being migrated, even
though set_cpus_allowed_ptr() already identified a candidate target
earlier on.

Unfortunately, the stopper doesn't check whether or not the new
destination CPU is active or not, so __migrate_task() can leave the task
sitting on a CPU that is outside of its affinity mask, even if the CPU
originally chosen by SCA is still active.

For example, with CONFIG_CPUSET=n:

 $ taskset -pc 0-2 $PID
 # offline CPUs 3-4
 $ taskset -pc 3-5 $PID

Then $PID remains on its current CPU (one of 0-2) and does not get
migrated to CPU 5.

Rework 'struct migration_arg' so that an optional pointer to an affinity
mask can be provided to the stopper, allowing us to respect the
original choice of destination CPU when migrating. Note that there is
still the potential to race with a concurrent CPU hot-unplug of the
destination CPU if the caller does not hold the hotplug lock.

Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/sched/core.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..1702a60d178d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1869,6 +1869,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 struct migration_arg {
 	struct task_struct		*task;
 	int				dest_cpu;
+	const struct cpumask		*dest_mask;
 	struct set_affinity_pending	*pending;
 };
 
@@ -1917,6 +1918,7 @@ static int migration_cpu_stop(void *data)
 	struct set_affinity_pending *pending = arg->pending;
 	struct task_struct *p = arg->task;
 	int dest_cpu = arg->dest_cpu;
+	const struct cpumask *dest_mask = arg->dest_mask;
 	struct rq *rq = this_rq();
 	bool complete = false;
 	struct rq_flags rf;
@@ -1956,12 +1958,8 @@ static int migration_cpu_stop(void *data)
 			complete = true;
 		}
 
-		if (dest_cpu < 0) {
-			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
-				goto out;
-
-			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
-		}
+		if (dest_mask && (cpumask_test_cpu(task_cpu(p), dest_mask)))
+			goto out;
 
 		if (task_on_rq_queued(p))
 			rq = __migrate_task(rq, &rf, p, dest_cpu);
@@ -2249,7 +2247,8 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			init_completion(&my_pending.done);
 			my_pending.arg = (struct migration_arg) {
 				.task = p,
-				.dest_cpu = -1,		/* any */
+				.dest_cpu = dest_cpu,
+				.dest_mask = &p->cpus_mask,
 				.pending = &my_pending,
 			};
 
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 02/22] arm64: cpuinfo: Split AArch32 registers out into a separate struct
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
  2021-05-25 15:14 ` [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 03/22] arm64: Allow mismatched 32-bit EL0 support Will Deacon
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

In preparation for late initialisation of the "sanitised" AArch32 register
state, move the AArch32 registers out of 'struct cpuinfo' and into their
own struct definition.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cpu.h   | 44 +++++++++++----------
 arch/arm64/kernel/cpufeature.c | 71 ++++++++++++++++++----------------
 arch/arm64/kernel/cpuinfo.c    | 53 +++++++++++++------------
 3 files changed, 89 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index 7faae6ff3ab4..f4e01aa0f442 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -12,26 +12,7 @@
 /*
  * Records attributes of an individual CPU.
  */
-struct cpuinfo_arm64 {
-	struct cpu	cpu;
-	struct kobject	kobj;
-	u32		reg_ctr;
-	u32		reg_cntfrq;
-	u32		reg_dczid;
-	u32		reg_midr;
-	u32		reg_revidr;
-
-	u64		reg_id_aa64dfr0;
-	u64		reg_id_aa64dfr1;
-	u64		reg_id_aa64isar0;
-	u64		reg_id_aa64isar1;
-	u64		reg_id_aa64mmfr0;
-	u64		reg_id_aa64mmfr1;
-	u64		reg_id_aa64mmfr2;
-	u64		reg_id_aa64pfr0;
-	u64		reg_id_aa64pfr1;
-	u64		reg_id_aa64zfr0;
-
+struct cpuinfo_32bit {
 	u32		reg_id_dfr0;
 	u32		reg_id_dfr1;
 	u32		reg_id_isar0;
@@ -54,6 +35,29 @@ struct cpuinfo_arm64 {
 	u32		reg_mvfr0;
 	u32		reg_mvfr1;
 	u32		reg_mvfr2;
+};
+
+struct cpuinfo_arm64 {
+	struct cpu	cpu;
+	struct kobject	kobj;
+	u32		reg_ctr;
+	u32		reg_cntfrq;
+	u32		reg_dczid;
+	u32		reg_midr;
+	u32		reg_revidr;
+
+	u64		reg_id_aa64dfr0;
+	u64		reg_id_aa64dfr1;
+	u64		reg_id_aa64isar0;
+	u64		reg_id_aa64isar1;
+	u64		reg_id_aa64mmfr0;
+	u64		reg_id_aa64mmfr1;
+	u64		reg_id_aa64mmfr2;
+	u64		reg_id_aa64pfr0;
+	u64		reg_id_aa64pfr1;
+	u64		reg_id_aa64zfr0;
+
+	struct cpuinfo_32bit	aarch32;
 
 	/* pseudo-ZCR for recording maximum ZCR_EL1 LEN value: */
 	u64		reg_zcr;
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index efed2830d141..a4db25cd7122 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -863,6 +863,31 @@ static void __init init_cpu_hwcaps_indirect_list(void)
 
 static void __init setup_boot_cpu_capabilities(void);
 
+static void __init init_32bit_cpu_features(struct cpuinfo_32bit *info)
+{
+	init_cpu_ftr_reg(SYS_ID_DFR0_EL1, info->reg_id_dfr0);
+	init_cpu_ftr_reg(SYS_ID_DFR1_EL1, info->reg_id_dfr1);
+	init_cpu_ftr_reg(SYS_ID_ISAR0_EL1, info->reg_id_isar0);
+	init_cpu_ftr_reg(SYS_ID_ISAR1_EL1, info->reg_id_isar1);
+	init_cpu_ftr_reg(SYS_ID_ISAR2_EL1, info->reg_id_isar2);
+	init_cpu_ftr_reg(SYS_ID_ISAR3_EL1, info->reg_id_isar3);
+	init_cpu_ftr_reg(SYS_ID_ISAR4_EL1, info->reg_id_isar4);
+	init_cpu_ftr_reg(SYS_ID_ISAR5_EL1, info->reg_id_isar5);
+	init_cpu_ftr_reg(SYS_ID_ISAR6_EL1, info->reg_id_isar6);
+	init_cpu_ftr_reg(SYS_ID_MMFR0_EL1, info->reg_id_mmfr0);
+	init_cpu_ftr_reg(SYS_ID_MMFR1_EL1, info->reg_id_mmfr1);
+	init_cpu_ftr_reg(SYS_ID_MMFR2_EL1, info->reg_id_mmfr2);
+	init_cpu_ftr_reg(SYS_ID_MMFR3_EL1, info->reg_id_mmfr3);
+	init_cpu_ftr_reg(SYS_ID_MMFR4_EL1, info->reg_id_mmfr4);
+	init_cpu_ftr_reg(SYS_ID_MMFR5_EL1, info->reg_id_mmfr5);
+	init_cpu_ftr_reg(SYS_ID_PFR0_EL1, info->reg_id_pfr0);
+	init_cpu_ftr_reg(SYS_ID_PFR1_EL1, info->reg_id_pfr1);
+	init_cpu_ftr_reg(SYS_ID_PFR2_EL1, info->reg_id_pfr2);
+	init_cpu_ftr_reg(SYS_MVFR0_EL1, info->reg_mvfr0);
+	init_cpu_ftr_reg(SYS_MVFR1_EL1, info->reg_mvfr1);
+	init_cpu_ftr_reg(SYS_MVFR2_EL1, info->reg_mvfr2);
+}
+
 void __init init_cpu_features(struct cpuinfo_arm64 *info)
 {
 	/* Before we start using the tables, make sure it is sorted */
@@ -882,29 +907,8 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info)
 	init_cpu_ftr_reg(SYS_ID_AA64PFR1_EL1, info->reg_id_aa64pfr1);
 	init_cpu_ftr_reg(SYS_ID_AA64ZFR0_EL1, info->reg_id_aa64zfr0);
 
-	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
-		init_cpu_ftr_reg(SYS_ID_DFR0_EL1, info->reg_id_dfr0);
-		init_cpu_ftr_reg(SYS_ID_DFR1_EL1, info->reg_id_dfr1);
-		init_cpu_ftr_reg(SYS_ID_ISAR0_EL1, info->reg_id_isar0);
-		init_cpu_ftr_reg(SYS_ID_ISAR1_EL1, info->reg_id_isar1);
-		init_cpu_ftr_reg(SYS_ID_ISAR2_EL1, info->reg_id_isar2);
-		init_cpu_ftr_reg(SYS_ID_ISAR3_EL1, info->reg_id_isar3);
-		init_cpu_ftr_reg(SYS_ID_ISAR4_EL1, info->reg_id_isar4);
-		init_cpu_ftr_reg(SYS_ID_ISAR5_EL1, info->reg_id_isar5);
-		init_cpu_ftr_reg(SYS_ID_ISAR6_EL1, info->reg_id_isar6);
-		init_cpu_ftr_reg(SYS_ID_MMFR0_EL1, info->reg_id_mmfr0);
-		init_cpu_ftr_reg(SYS_ID_MMFR1_EL1, info->reg_id_mmfr1);
-		init_cpu_ftr_reg(SYS_ID_MMFR2_EL1, info->reg_id_mmfr2);
-		init_cpu_ftr_reg(SYS_ID_MMFR3_EL1, info->reg_id_mmfr3);
-		init_cpu_ftr_reg(SYS_ID_MMFR4_EL1, info->reg_id_mmfr4);
-		init_cpu_ftr_reg(SYS_ID_MMFR5_EL1, info->reg_id_mmfr5);
-		init_cpu_ftr_reg(SYS_ID_PFR0_EL1, info->reg_id_pfr0);
-		init_cpu_ftr_reg(SYS_ID_PFR1_EL1, info->reg_id_pfr1);
-		init_cpu_ftr_reg(SYS_ID_PFR2_EL1, info->reg_id_pfr2);
-		init_cpu_ftr_reg(SYS_MVFR0_EL1, info->reg_mvfr0);
-		init_cpu_ftr_reg(SYS_MVFR1_EL1, info->reg_mvfr1);
-		init_cpu_ftr_reg(SYS_MVFR2_EL1, info->reg_mvfr2);
-	}
+	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0))
+		init_32bit_cpu_features(&info->aarch32);
 
 	if (id_aa64pfr0_sve(info->reg_id_aa64pfr0)) {
 		init_cpu_ftr_reg(SYS_ZCR_EL1, info->reg_zcr);
@@ -975,20 +979,12 @@ static void relax_cpu_ftr_reg(u32 sys_id, int field)
 	WARN_ON(!ftrp->width);
 }
 
-static int update_32bit_cpu_features(int cpu, struct cpuinfo_arm64 *info,
-				     struct cpuinfo_arm64 *boot)
+static int update_32bit_cpu_features(int cpu, struct cpuinfo_32bit *info,
+				     struct cpuinfo_32bit *boot)
 {
 	int taint = 0;
 	u64 pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
 
-	/*
-	 * If we don't have AArch32 at all then skip the checks entirely
-	 * as the register values may be UNKNOWN and we're not going to be
-	 * using them for anything.
-	 */
-	if (!id_aa64pfr0_32bit_el0(pfr0))
-		return taint;
-
 	/*
 	 * If we don't have AArch32 at EL1, then relax the strictness of
 	 * EL1-dependent register fields to avoid spurious sanity check fails.
@@ -1135,10 +1131,17 @@ void update_cpu_features(int cpu,
 	}
 
 	/*
+	 * If we don't have AArch32 at all then skip the checks entirely
+	 * as the register values may be UNKNOWN and we're not going to be
+	 * using them for anything.
+	 *
 	 * This relies on a sanitised view of the AArch64 ID registers
 	 * (e.g. SYS_ID_AA64PFR0_EL1), so we call it last.
 	 */
-	taint |= update_32bit_cpu_features(cpu, info, boot);
+	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
+		taint |= update_32bit_cpu_features(cpu, &info->aarch32,
+						   &boot->aarch32);
+	}
 
 	/*
 	 * Mismatched CPU features are a recipe for disaster. Don't even
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 51fcf99d5351..264c119a6cae 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -344,6 +344,32 @@ static void cpuinfo_detect_icache_policy(struct cpuinfo_arm64 *info)
 	pr_info("Detected %s I-cache on CPU%d\n", icache_policy_str[l1ip], cpu);
 }
 
+static void __cpuinfo_store_cpu_32bit(struct cpuinfo_32bit *info)
+{
+	info->reg_id_dfr0 = read_cpuid(ID_DFR0_EL1);
+	info->reg_id_dfr1 = read_cpuid(ID_DFR1_EL1);
+	info->reg_id_isar0 = read_cpuid(ID_ISAR0_EL1);
+	info->reg_id_isar1 = read_cpuid(ID_ISAR1_EL1);
+	info->reg_id_isar2 = read_cpuid(ID_ISAR2_EL1);
+	info->reg_id_isar3 = read_cpuid(ID_ISAR3_EL1);
+	info->reg_id_isar4 = read_cpuid(ID_ISAR4_EL1);
+	info->reg_id_isar5 = read_cpuid(ID_ISAR5_EL1);
+	info->reg_id_isar6 = read_cpuid(ID_ISAR6_EL1);
+	info->reg_id_mmfr0 = read_cpuid(ID_MMFR0_EL1);
+	info->reg_id_mmfr1 = read_cpuid(ID_MMFR1_EL1);
+	info->reg_id_mmfr2 = read_cpuid(ID_MMFR2_EL1);
+	info->reg_id_mmfr3 = read_cpuid(ID_MMFR3_EL1);
+	info->reg_id_mmfr4 = read_cpuid(ID_MMFR4_EL1);
+	info->reg_id_mmfr5 = read_cpuid(ID_MMFR5_EL1);
+	info->reg_id_pfr0 = read_cpuid(ID_PFR0_EL1);
+	info->reg_id_pfr1 = read_cpuid(ID_PFR1_EL1);
+	info->reg_id_pfr2 = read_cpuid(ID_PFR2_EL1);
+
+	info->reg_mvfr0 = read_cpuid(MVFR0_EL1);
+	info->reg_mvfr1 = read_cpuid(MVFR1_EL1);
+	info->reg_mvfr2 = read_cpuid(MVFR2_EL1);
+}
+
 static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
 {
 	info->reg_cntfrq = arch_timer_get_cntfrq();
@@ -371,31 +397,8 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
 	info->reg_id_aa64pfr1 = read_cpuid(ID_AA64PFR1_EL1);
 	info->reg_id_aa64zfr0 = read_cpuid(ID_AA64ZFR0_EL1);
 
-	/* Update the 32bit ID registers only if AArch32 is implemented */
-	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
-		info->reg_id_dfr0 = read_cpuid(ID_DFR0_EL1);
-		info->reg_id_dfr1 = read_cpuid(ID_DFR1_EL1);
-		info->reg_id_isar0 = read_cpuid(ID_ISAR0_EL1);
-		info->reg_id_isar1 = read_cpuid(ID_ISAR1_EL1);
-		info->reg_id_isar2 = read_cpuid(ID_ISAR2_EL1);
-		info->reg_id_isar3 = read_cpuid(ID_ISAR3_EL1);
-		info->reg_id_isar4 = read_cpuid(ID_ISAR4_EL1);
-		info->reg_id_isar5 = read_cpuid(ID_ISAR5_EL1);
-		info->reg_id_isar6 = read_cpuid(ID_ISAR6_EL1);
-		info->reg_id_mmfr0 = read_cpuid(ID_MMFR0_EL1);
-		info->reg_id_mmfr1 = read_cpuid(ID_MMFR1_EL1);
-		info->reg_id_mmfr2 = read_cpuid(ID_MMFR2_EL1);
-		info->reg_id_mmfr3 = read_cpuid(ID_MMFR3_EL1);
-		info->reg_id_mmfr4 = read_cpuid(ID_MMFR4_EL1);
-		info->reg_id_mmfr5 = read_cpuid(ID_MMFR5_EL1);
-		info->reg_id_pfr0 = read_cpuid(ID_PFR0_EL1);
-		info->reg_id_pfr1 = read_cpuid(ID_PFR1_EL1);
-		info->reg_id_pfr2 = read_cpuid(ID_PFR2_EL1);
-
-		info->reg_mvfr0 = read_cpuid(MVFR0_EL1);
-		info->reg_mvfr1 = read_cpuid(MVFR1_EL1);
-		info->reg_mvfr2 = read_cpuid(MVFR2_EL1);
-	}
+	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0))
+		__cpuinfo_store_cpu_32bit(&info->aarch32);
 
 	if (IS_ENABLED(CONFIG_ARM64_SVE) &&
 	    id_aa64pfr0_sve(info->reg_id_aa64pfr0))
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 03/22] arm64: Allow mismatched 32-bit EL0 support
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
  2021-05-25 15:14 ` [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination Will Deacon
  2021-05-25 15:14 ` [PATCH v7 02/22] arm64: cpuinfo: Split AArch32 registers out into a separate struct Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 04/22] KVM: arm64: Kill 32-bit vCPUs on systems with mismatched " Will Deacon
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

When confronted with a mixture of CPUs, some of which support 32-bit
applications and others which don't, we quite sensibly treat the system
as 64-bit only for userspace and prevent execve() of 32-bit binaries.

Unfortunately, some crazy folks have decided to build systems like this
with the intention of running 32-bit applications, so relax our
sanitisation logic to continue to advertise 32-bit support to userspace
on these systems and track the real 32-bit capable cores in a cpumask
instead. For now, the default behaviour remains but will be tied to
a command-line option in a later patch.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cpufeature.h |   8 +-
 arch/arm64/kernel/cpufeature.c      | 114 ++++++++++++++++++++++++----
 arch/arm64/tools/cpucaps            |   3 +-
 3 files changed, 110 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 338840c00e8e..603bf4160cd6 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -630,9 +630,15 @@ static inline bool cpu_supports_mixed_endian_el0(void)
 	return id_aa64mmfr0_mixed_endian_el0(read_cpuid(ID_AA64MMFR0_EL1));
 }
 
+const struct cpumask *system_32bit_el0_cpumask(void);
+DECLARE_STATIC_KEY_FALSE(arm64_mismatched_32bit_el0);
+
 static inline bool system_supports_32bit_el0(void)
 {
-	return cpus_have_const_cap(ARM64_HAS_32BIT_EL0);
+	u64 pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
+
+	return static_branch_unlikely(&arm64_mismatched_32bit_el0) ||
+	       id_aa64pfr0_32bit_el0(pfr0);
 }
 
 static inline bool system_supports_4kb_granule(void)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index a4db25cd7122..4194a47de62d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -107,6 +107,24 @@ DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE);
 bool arm64_use_ng_mappings = false;
 EXPORT_SYMBOL(arm64_use_ng_mappings);
 
+/*
+ * Permit PER_LINUX32 and execve() of 32-bit binaries even if not all CPUs
+ * support it?
+ */
+static bool __read_mostly allow_mismatched_32bit_el0;
+
+/*
+ * Static branch enabled only if allow_mismatched_32bit_el0 is set and we have
+ * seen at least one CPU capable of 32-bit EL0.
+ */
+DEFINE_STATIC_KEY_FALSE(arm64_mismatched_32bit_el0);
+
+/*
+ * Mask of CPUs supporting 32-bit EL0.
+ * Only valid if arm64_mismatched_32bit_el0 is enabled.
+ */
+static cpumask_var_t cpu_32bit_el0_mask __cpumask_var_read_mostly;
+
 /*
  * Flag to indicate if we have computed the system wide
  * capabilities based on the boot time active CPUs. This
@@ -767,7 +785,7 @@ static void __init sort_ftr_regs(void)
  * Any bits that are not covered by an arm64_ftr_bits entry are considered
  * RES0 for the system-wide value, and must strictly match.
  */
-static void __init init_cpu_ftr_reg(u32 sys_reg, u64 new)
+static void init_cpu_ftr_reg(u32 sys_reg, u64 new)
 {
 	u64 val = 0;
 	u64 strict_mask = ~0x0ULL;
@@ -863,7 +881,7 @@ static void __init init_cpu_hwcaps_indirect_list(void)
 
 static void __init setup_boot_cpu_capabilities(void);
 
-static void __init init_32bit_cpu_features(struct cpuinfo_32bit *info)
+static void init_32bit_cpu_features(struct cpuinfo_32bit *info)
 {
 	init_cpu_ftr_reg(SYS_ID_DFR0_EL1, info->reg_id_dfr0);
 	init_cpu_ftr_reg(SYS_ID_DFR1_EL1, info->reg_id_dfr1);
@@ -979,6 +997,22 @@ static void relax_cpu_ftr_reg(u32 sys_id, int field)
 	WARN_ON(!ftrp->width);
 }
 
+static void update_mismatched_32bit_el0_cpu_features(struct cpuinfo_arm64 *info,
+						     struct cpuinfo_arm64 *boot)
+{
+	static bool boot_cpu_32bit_regs_overridden = false;
+
+	if (!allow_mismatched_32bit_el0 || boot_cpu_32bit_regs_overridden)
+		return;
+
+	if (id_aa64pfr0_32bit_el0(boot->reg_id_aa64pfr0))
+		return;
+
+	boot->aarch32 = info->aarch32;
+	init_32bit_cpu_features(&boot->aarch32);
+	boot_cpu_32bit_regs_overridden = true;
+}
+
 static int update_32bit_cpu_features(int cpu, struct cpuinfo_32bit *info,
 				     struct cpuinfo_32bit *boot)
 {
@@ -1139,6 +1173,7 @@ void update_cpu_features(int cpu,
 	 * (e.g. SYS_ID_AA64PFR0_EL1), so we call it last.
 	 */
 	if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
+		update_mismatched_32bit_el0_cpu_features(info, boot);
 		taint |= update_32bit_cpu_features(cpu, &info->aarch32,
 						   &boot->aarch32);
 	}
@@ -1251,6 +1286,28 @@ has_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
 	return feature_matches(val, entry);
 }
 
+const struct cpumask *system_32bit_el0_cpumask(void)
+{
+	if (!system_supports_32bit_el0())
+		return cpu_none_mask;
+
+	if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+		return cpu_32bit_el0_mask;
+
+	return cpu_possible_mask;
+}
+
+static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
+{
+	if (!has_cpuid_feature(entry, scope))
+		return allow_mismatched_32bit_el0;
+
+	if (scope == SCOPE_SYSTEM)
+		pr_info("detected: 32-bit EL0 Support\n");
+
+	return true;
+}
+
 static bool has_useable_gicv3_cpuif(const struct arm64_cpu_capabilities *entry, int scope)
 {
 	bool has_sre;
@@ -1869,10 +1926,9 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.cpu_enable = cpu_copy_el2regs,
 	},
 	{
-		.desc = "32-bit EL0 Support",
-		.capability = ARM64_HAS_32BIT_EL0,
+		.capability = ARM64_HAS_32BIT_EL0_DO_NOT_USE,
 		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
-		.matches = has_cpuid_feature,
+		.matches = has_32bit_el0,
 		.sys_reg = SYS_ID_AA64PFR0_EL1,
 		.sign = FTR_UNSIGNED,
 		.field_pos = ID_AA64PFR0_EL0_SHIFT,
@@ -2381,7 +2437,7 @@ static const struct arm64_cpu_capabilities compat_elf_hwcaps[] = {
 	{},
 };
 
-static void __init cap_set_elf_hwcap(const struct arm64_cpu_capabilities *cap)
+static void cap_set_elf_hwcap(const struct arm64_cpu_capabilities *cap)
 {
 	switch (cap->hwcap_type) {
 	case CAP_HWCAP:
@@ -2426,7 +2482,7 @@ static bool cpus_have_elf_hwcap(const struct arm64_cpu_capabilities *cap)
 	return rc;
 }
 
-static void __init setup_elf_hwcaps(const struct arm64_cpu_capabilities *hwcaps)
+static void setup_elf_hwcaps(const struct arm64_cpu_capabilities *hwcaps)
 {
 	/* We support emulation of accesses to CPU ID feature registers */
 	cpu_set_named_feature(CPUID);
@@ -2601,7 +2657,7 @@ static void check_early_cpu_features(void)
 }
 
 static void
-verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
+__verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
 {
 
 	for (; caps->matches; caps++)
@@ -2612,6 +2668,14 @@ verify_local_elf_hwcaps(const struct arm64_cpu_capabilities *caps)
 		}
 }
 
+static void verify_local_elf_hwcaps(void)
+{
+	__verify_local_elf_hwcaps(arm64_elf_hwcaps);
+
+	if (id_aa64pfr0_32bit_el0(read_cpuid(ID_AA64PFR0_EL1)))
+		__verify_local_elf_hwcaps(compat_elf_hwcaps);
+}
+
 static void verify_sve_features(void)
 {
 	u64 safe_zcr = read_sanitised_ftr_reg(SYS_ZCR_EL1);
@@ -2676,11 +2740,7 @@ static void verify_local_cpu_capabilities(void)
 	 * on all secondary CPUs.
 	 */
 	verify_local_cpu_caps(SCOPE_ALL & ~SCOPE_BOOT_CPU);
-
-	verify_local_elf_hwcaps(arm64_elf_hwcaps);
-
-	if (system_supports_32bit_el0())
-		verify_local_elf_hwcaps(compat_elf_hwcaps);
+	verify_local_elf_hwcaps();
 
 	if (system_supports_sve())
 		verify_sve_features();
@@ -2815,6 +2875,34 @@ void __init setup_cpu_features(void)
 			ARCH_DMA_MINALIGN);
 }
 
+static int enable_mismatched_32bit_el0(unsigned int cpu)
+{
+	struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
+	bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
+
+	if (cpu_32bit) {
+		cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
+		static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
+		setup_elf_hwcaps(compat_elf_hwcaps);
+	}
+
+	return 0;
+}
+
+static int __init init_32bit_el0_mask(void)
+{
+	if (!allow_mismatched_32bit_el0)
+		return 0;
+
+	if (!zalloc_cpumask_var(&cpu_32bit_el0_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	return cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+				 "arm64/mismatched_32bit_el0:online",
+				 enable_mismatched_32bit_el0, NULL);
+}
+subsys_initcall_sync(init_32bit_el0_mask);
+
 static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
 {
 	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 21fbdda7086e..49305c2e6dfd 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -3,7 +3,8 @@
 # Internal CPU capabilities constants, keep this list sorted
 
 BTI
-HAS_32BIT_EL0
+# Unreliable: use system_supports_32bit_el0() instead.
+HAS_32BIT_EL0_DO_NOT_USE
 HAS_32BIT_EL1
 HAS_ADDRESS_AUTH
 HAS_ADDRESS_AUTH_ARCH
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 04/22] KVM: arm64: Kill 32-bit vCPUs on systems with mismatched EL0 support
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (2 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 03/22] arm64: Allow mismatched 32-bit EL0 support Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 05/22] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs Will Deacon
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

If a vCPU is caught running 32-bit code on a system with mismatched
support at EL0, then we should kill it.

Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kvm/arm.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..5bdba97a7654 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -692,6 +692,15 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool vcpu_mode_is_bad_32bit(struct kvm_vcpu *vcpu)
+{
+	if (likely(!vcpu_mode_is_32bit(vcpu)))
+		return false;
+
+	return !system_supports_32bit_el0() ||
+		static_branch_unlikely(&arm64_mismatched_32bit_el0);
+}
+
 /**
  * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
  * @vcpu:	The VCPU pointer
@@ -875,7 +884,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * with the asymmetric AArch32 case), return to userspace with
 		 * a fatal error.
 		 */
-		if (!system_supports_32bit_el0() && vcpu_mode_is_32bit(vcpu)) {
+		if (vcpu_mode_is_bad_32bit(vcpu)) {
 			/*
 			 * As we have caught the guest red-handed, decide that
 			 * it isn't fit for purpose anymore by making the vcpu
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 05/22] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (3 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 04/22] KVM: arm64: Kill 32-bit vCPUs on systems with mismatched " Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 06/22] arm64: Advertise CPUs capable of running 32-bit applications in sysfs Will Deacon
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Scheduling a 32-bit application on a 64-bit-only CPU is a bad idea.

Ensure that 32-bit applications always take the slow-path when returning
to userspace on a system with mismatched support at EL0, so that we can
avoid trying to run on a 64-bit-only CPU and force a SIGKILL instead.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/process.c | 19 ++++++++++++++++++-
 arch/arm64/kernel/signal.c  | 26 ++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index b4bb67f17a2c..f4a91bf1ce0c 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -527,6 +527,15 @@ static void erratum_1418040_thread_switch(struct task_struct *prev,
 	write_sysreg(val, cntkctl_el1);
 }
 
+static void compat_thread_switch(struct task_struct *next)
+{
+	if (!is_compat_thread(task_thread_info(next)))
+		return;
+
+	if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+		set_tsk_thread_flag(next, TIF_NOTIFY_RESUME);
+}
+
 static void update_sctlr_el1(u64 sctlr)
 {
 	/*
@@ -568,6 +577,7 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
 	ssbs_thread_switch(next);
 	erratum_1418040_thread_switch(prev, next);
 	ptrauth_thread_switch_user(next);
+	compat_thread_switch(next);
 
 	/*
 	 * Complete any pending TLB or cache maintenance on this CPU in case
@@ -633,8 +643,15 @@ unsigned long arch_align_stack(unsigned long sp)
  */
 void arch_setup_new_exec(void)
 {
-	current->mm->context.flags = is_compat_task() ? MMCF_AARCH32 : 0;
+	unsigned long mmflags = 0;
+
+	if (is_compat_task()) {
+		mmflags = MMCF_AARCH32;
+		if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	}
 
+	current->mm->context.flags = mmflags;
 	ptrauth_thread_init_user();
 	mte_thread_init_user();
 
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 6237486ff6bb..f8192f4ae0b8 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -911,6 +911,19 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+static bool cpu_affinity_invalid(struct pt_regs *regs)
+{
+	if (!compat_user_mode(regs))
+		return false;
+
+	/*
+	 * We're preemptible, but a reschedule will cause us to check the
+	 * affinity again.
+	 */
+	return !cpumask_test_cpu(raw_smp_processor_id(),
+				 system_32bit_el0_cpumask());
+}
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -938,6 +951,19 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 			if (thread_flags & _TIF_NOTIFY_RESUME) {
 				tracehook_notify_resume(regs);
 				rseq_handle_notify_resume(NULL, regs);
+
+				/*
+				 * If we reschedule after checking the affinity
+				 * then we must ensure that TIF_NOTIFY_RESUME
+				 * is set so that we check the affinity again.
+				 * Since tracehook_notify_resume() clears the
+				 * flag, ensure that the compiler doesn't move
+				 * it after the affinity check.
+				 */
+				barrier();
+
+				if (cpu_affinity_invalid(regs))
+					force_sig(SIGKILL);
 			}
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 06/22] arm64: Advertise CPUs capable of running 32-bit applications in sysfs
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (4 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 05/22] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 07/22] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection Will Deacon
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Since 32-bit applications will be killed if they are caught trying to
execute on a 64-bit-only CPU in a mismatched system, advertise the set
of 32-bit capable CPUs to userspace in sysfs.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 .../ABI/testing/sysfs-devices-system-cpu      |  9 +++++++++
 arch/arm64/kernel/cpufeature.c                | 19 +++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index fe13baa53c59..899377b2715a 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -494,6 +494,15 @@ Description:	AArch64 CPU registers
 		'identification' directory exposes the CPU ID registers for
 		identifying model and revision of the CPU.
 
+What:		/sys/devices/system/cpu/aarch32_el0
+Date:		May 2021
+Contact:	Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
+Description:	Identifies the subset of CPUs in the system that can execute
+		AArch32 (32-bit ARM) applications. If present, the same format as
+		/sys/devices/system/cpu/{offline,online,possible,present} is used.
+		If absent, then all or none of the CPUs can execute AArch32
+		applications and execve() will behave accordingly.
+
 What:		/sys/devices/system/cpu/cpu#/cpu_capacity
 Date:		December 2016
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 4194a47de62d..959442f76ed7 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -67,6 +67,7 @@
 #include <linux/crash_dump.h>
 #include <linux/sort.h>
 #include <linux/stop_machine.h>
+#include <linux/sysfs.h>
 #include <linux/types.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
@@ -1297,6 +1298,24 @@ const struct cpumask *system_32bit_el0_cpumask(void)
 	return cpu_possible_mask;
 }
 
+static ssize_t aarch32_el0_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	const struct cpumask *mask = system_32bit_el0_cpumask();
+
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(mask));
+}
+static const DEVICE_ATTR_RO(aarch32_el0);
+
+static int __init aarch32_el0_sysfs_init(void)
+{
+	if (!allow_mismatched_32bit_el0)
+		return 0;
+
+	return device_create_file(cpu_subsys.dev_root, &dev_attr_aarch32_el0);
+}
+device_initcall(aarch32_el0_sysfs_init);
+
 static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, int scope)
 {
 	if (!has_cpuid_feature(entry, scope))
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 07/22] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (5 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 06/22] arm64: Advertise CPUs capable of running 32-bit applications in sysfs Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 Will Deacon
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

On such a system, we must take care not to migrate a task to an
unsupported CPU when forcefully moving tasks in select_fallback_rq()
in response to a CPU hot-unplug operation.

Introduce a task_cpu_possible_mask() hook which, given a task argument,
allows an architecture to return a cpumask of CPUs that are capable of
executing that task. The default implementation returns the
cpu_possible_mask, since sane machines do not suffer from per-cpu ISA
limitations that affect scheduling. The new mask is used when selecting
the fallback runqueue as a last resort before forcing a migration to the
first active CPU.

Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/mmu_context.h | 11 +++++++++++
 kernel/sched/core.c         |  5 ++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 03dee12d2b61..1a599ba3524f 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -14,4 +14,15 @@
 static inline void leave_mm(int cpu) { }
 #endif
 
+/*
+ * CPUs that are capable of running task @p. By default, we assume a sane,
+ * homogeneous system. Must contain at least one active CPU.
+ */
+#ifndef task_cpu_possible_mask
+# define task_cpu_possible_mask(p)	cpu_possible_mask
+# define task_cpu_possible(cpu, p)	true
+#else
+# define task_cpu_possible(cpu, p)	cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1702a60d178d..00ed51528c70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1814,7 +1814,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu);
+		return cpu_active(cpu) && task_cpu_possible(cpu, p);
 
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@@ -2791,10 +2791,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 			 *
 			 * More yuck to audit.
 			 */
-			do_set_cpus_allowed(p, cpu_possible_mask);
+			do_set_cpus_allowed(p, task_cpu_possible_mask(p));
 			state = fail;
 			break;
-
 		case fail:
 			BUG();
 			break;
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (6 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 07/22] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-26 15:02   ` Peter Zijlstra
  2021-05-25 15:14 ` [PATCH v7 09/22] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() Will Deacon
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team, Li Zefan

If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.

In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/cpuset.h |  1 +
 kernel/cgroup/cpuset.c | 12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 04c20de66afc..ed6ec677dd6b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -15,6 +15,7 @@
 #include <linux/cpumask.h>
 #include <linux/nodemask.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/jump_label.h>
 
 #ifdef CONFIG_CPUSETS
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a945504c0ae7..8c799260a4a2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3322,9 +3322,17 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
+	const struct cpumask *cs_mask;
+	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+
 	rcu_read_lock();
-	do_set_cpus_allowed(tsk, is_in_v2_mode() ?
-		task_cs(tsk)->cpus_allowed : cpu_possible_mask);
+	cs_mask = task_cs(tsk)->cpus_allowed;
+
+	if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
+		goto unlock; /* select_fallback_rq will try harder */
+
+	do_set_cpus_allowed(tsk, cs_mask);
+unlock:
 	rcu_read_unlock();
 
 	/*
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 09/22] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (7 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask() Will Deacon
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team, Li Zefan

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/cpuset.h |  2 +-
 kernel/cgroup/cpuset.c | 41 ++++++++++++++++++++++++++---------------
 2 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index ed6ec677dd6b..414a8e694413 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -185,7 +185,7 @@ static inline void cpuset_read_unlock(void) { }
 static inline void cpuset_cpus_allowed(struct task_struct *p,
 				       struct cpumask *mask)
 {
-	cpumask_copy(mask, cpu_possible_mask);
+	cpumask_copy(mask, task_cpu_possible_mask(p));
 }
 
 static inline void cpuset_cpus_allowed_fallback(struct task_struct *p)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8c799260a4a2..e0649c970f48 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -372,18 +372,29 @@ static inline bool is_in_v2_mode(void)
 }
 
 /*
- * Return in pmask the portion of a cpusets's cpus_allowed that
- * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.
+ * Return in pmask the portion of a task's cpusets's cpus_allowed that
+ * are online and are capable of running the task.  If none are found,
+ * walk up the cpuset hierarchy until we find one that does have some
+ * appropriate cpus.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
  *
  * Call with callback_lock or cpuset_mutex held.
  */
-static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
+static void guarantee_online_cpus(struct task_struct *tsk,
+				  struct cpumask *pmask)
 {
-	while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
+	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
+	struct cpuset *cs;
+
+	if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_online_mask)))
+		cpumask_copy(pmask, cpu_online_mask);
+
+	rcu_read_lock();
+	cs = task_cs(tsk);
+
+	while (!cpumask_intersects(cs->effective_cpus, pmask)) {
 		cs = parent_cs(cs);
 		if (unlikely(!cs)) {
 			/*
@@ -393,11 +404,13 @@ static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 			 * cpuset's effective_cpus is on its way to be
 			 * identical to cpu_online_mask.
 			 */
-			cpumask_copy(pmask, cpu_online_mask);
-			return;
+			goto out_unlock;
 		}
 	}
-	cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
+	cpumask_and(pmask, pmask, cs->effective_cpus);
+
+out_unlock:
+	rcu_read_unlock();
 }
 
 /*
@@ -2199,15 +2212,13 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	percpu_down_write(&cpuset_rwsem);
 
-	/* prepare for attach */
-	if (cs == &top_cpuset)
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	else
-		guarantee_online_cpus(cs, cpus_attach);
-
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, css, tset) {
+		if (cs != &top_cpuset)
+			guarantee_online_cpus(task, cpus_attach);
+		else
+			cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
 		/*
 		 * can_attach beforehand should guarantee that this doesn't
 		 * fail.  TODO: have a better way to handle failure here
@@ -3303,7 +3314,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 
 	spin_lock_irqsave(&callback_lock, flags);
 	rcu_read_lock();
-	guarantee_online_cpus(task_cs(tsk), pmask);
+	guarantee_online_cpus(tsk, pmask);
 	rcu_read_unlock();
 	spin_unlock_irqrestore(&callback_lock, flags);
 }
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (8 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 09/22] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-26 15:15   ` Peter Zijlstra
  2021-05-25 15:14 ` [PATCH v7 11/22] sched: Introduce task_struct::user_cpus_ptr to track requested affinity Will Deacon
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Reject explicit requests to change the affinity mask of a task via
set_cpus_allowed_ptr() if the requested mask is not a subset of the
mask returned by task_cpu_possible_mask(). This ensures that the
'cpus_mask' for a given task cannot contain CPUs which are incapable of
executing it, except in cases where the affinity is forced.

Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 00ed51528c70..8ca7854747f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2346,6 +2346,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 				  u32 flags)
 {
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
+	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
 	unsigned int dest_cpu;
 	struct rq_flags rf;
 	struct rq *rq;
@@ -2366,6 +2367,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.
 		 */
 		cpu_valid_mask = cpu_online_mask;
+	} else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/*
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 11/22] sched: Introduce task_struct::user_cpus_ptr to track requested affinity
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (9 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask() Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 12/22] sched: Split the guts of sched_setaffinity() into a helper function Will Deacon
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

In preparation for saving and restoring the user-requested CPU affinity
mask of a task, add a new cpumask_t pointer to 'struct task_struct'.

If the pointer is non-NULL, then the mask is copied across fork() and
freed on task exit.

Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/sched.h | 13 +++++++++++++
 init/init_task.c      |  1 +
 kernel/fork.c         |  2 ++
 kernel/sched/core.c   | 20 ++++++++++++++++++++
 4 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c881384517..db32d4f7e5b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -730,6 +730,7 @@ struct task_struct {
 	unsigned int			policy;
 	int				nr_cpus_allowed;
 	const cpumask_t			*cpus_ptr;
+	cpumask_t			*user_cpus_ptr;
 	cpumask_t			cpus_mask;
 	void				*migration_pending;
 #ifdef CONFIG_SMP
@@ -1688,6 +1689,8 @@ extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
+extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
+extern void release_user_cpus_ptr(struct task_struct *p);
 #else
 static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
@@ -1698,6 +1701,16 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p, const struct cpuma
 		return -EINVAL;
 	return 0;
 }
+static inline int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node)
+{
+	if (src->user_cpus_ptr)
+		return -EINVAL;
+	return 0;
+}
+static inline void release_user_cpus_ptr(struct task_struct *p)
+{
+	WARN_ON(p->user_cpus_ptr);
+}
 #endif
 
 extern int yield_to(struct task_struct *p, bool preempt);
diff --git a/init/init_task.c b/init/init_task.c
index 8b08c2e19cbb..158c2b1689e1 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -80,6 +80,7 @@ struct task_struct init_task
 	.normal_prio	= MAX_PRIO - 20,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
+	.user_cpus_ptr	= NULL,
 	.cpus_mask	= CPU_MASK_ALL,
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..d3710e7f1686 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -446,6 +446,7 @@ void put_task_stack(struct task_struct *tsk)
 
 void free_task(struct task_struct *tsk)
 {
+	release_user_cpus_ptr(tsk);
 	scs_release(tsk);
 
 #ifndef CONFIG_THREAD_INFO_IN_TASK
@@ -918,6 +919,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #endif
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
+	dup_user_cpus_ptr(tsk, orig, node);
 
 	/*
 	 * One for the user space visible state that goes away when reaped.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ca7854747f1..9349b8ecbcf9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2124,6 +2124,26 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 	__do_set_cpus_allowed(p, new_mask, 0);
 }
 
+int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
+		      int node)
+{
+	if (!src->user_cpus_ptr)
+		return 0;
+
+	dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+	if (!dst->user_cpus_ptr)
+		return -ENOMEM;
+
+	cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+	return 0;
+}
+
+void release_user_cpus_ptr(struct task_struct *p)
+{
+	kfree(p->user_cpus_ptr);
+	p->user_cpus_ptr = NULL;
+}
+
 /*
  * This function is wildly self concurrent; here be dragons.
  *
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 12/22] sched: Split the guts of sched_setaffinity() into a helper function
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (10 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 11/22] sched: Introduce task_struct::user_cpus_ptr to track requested affinity Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems Will Deacon
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

In preparation for replaying user affinity requests using a saved mask,
split sched_setaffinity() up so that the initial task lookup and
security checks are only performed when the request is coming directly
from userspace.

Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/sched/core.c | 105 ++++++++++++++++++++++++--------------------
 1 file changed, 57 insertions(+), 48 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9349b8ecbcf9..0b7faca947a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6784,53 +6784,22 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	return retval;
 }
 
-long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+static int
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 {
-	cpumask_var_t cpus_allowed, new_mask;
-	struct task_struct *p;
 	int retval;
+	cpumask_var_t cpus_allowed, new_mask;
 
-	rcu_read_lock();
-
-	p = find_process_by_pid(pid);
-	if (!p) {
-		rcu_read_unlock();
-		return -ESRCH;
-	}
-
-	/* Prevent p going away */
-	get_task_struct(p);
-	rcu_read_unlock();
+	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
+		return -ENOMEM;
 
-	if (p->flags & PF_NO_SETAFFINITY) {
-		retval = -EINVAL;
-		goto out_put_task;
-	}
-	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
-		retval = -ENOMEM;
-		goto out_put_task;
-	}
 	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
 		retval = -ENOMEM;
 		goto out_free_cpus_allowed;
 	}
-	retval = -EPERM;
-	if (!check_same_owner(p)) {
-		rcu_read_lock();
-		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
-			rcu_read_unlock();
-			goto out_free_new_mask;
-		}
-		rcu_read_unlock();
-	}
-
-	retval = security_task_setscheduler(p);
-	if (retval)
-		goto out_free_new_mask;
-
 
 	cpuset_cpus_allowed(p, cpus_allowed);
-	cpumask_and(new_mask, in_mask, cpus_allowed);
+	cpumask_and(new_mask, mask, cpus_allowed);
 
 	/*
 	 * Since bandwidth control happens on root_domain basis,
@@ -6851,23 +6820,63 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 #endif
 again:
 	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+	if (retval)
+		goto out_free_new_mask;
 
-	if (!retval) {
-		cpuset_cpus_allowed(p, cpus_allowed);
-		if (!cpumask_subset(new_mask, cpus_allowed)) {
-			/*
-			 * We must have raced with a concurrent cpuset
-			 * update. Just reset the cpus_allowed to the
-			 * cpuset's cpus_allowed
-			 */
-			cpumask_copy(new_mask, cpus_allowed);
-			goto again;
-		}
+	cpuset_cpus_allowed(p, cpus_allowed);
+	if (!cpumask_subset(new_mask, cpus_allowed)) {
+		/*
+		 * We must have raced with a concurrent cpuset update.
+		 * Just reset the cpumask to the cpuset's cpus_allowed.
+		 */
+		cpumask_copy(new_mask, cpus_allowed);
+		goto again;
 	}
+
 out_free_new_mask:
 	free_cpumask_var(new_mask);
 out_free_cpus_allowed:
 	free_cpumask_var(cpus_allowed);
+	return retval;
+}
+
+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+{
+	struct task_struct *p;
+	int retval;
+
+	rcu_read_lock();
+
+	p = find_process_by_pid(pid);
+	if (!p) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	/* Prevent p going away */
+	get_task_struct(p);
+	rcu_read_unlock();
+
+	if (p->flags & PF_NO_SETAFFINITY) {
+		retval = -EINVAL;
+		goto out_put_task;
+	}
+
+	if (!check_same_owner(p)) {
+		rcu_read_lock();
+		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+			rcu_read_unlock();
+			retval = -EPERM;
+			goto out_put_task;
+		}
+		rcu_read_unlock();
+	}
+
+	retval = security_task_setscheduler(p);
+	if (retval)
+		goto out_put_task;
+
+	retval = __sched_setaffinity(p, in_mask);
 out_put_task:
 	put_task_struct(p);
 	return retval;
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (11 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 12/22] sched: Split the guts of sched_setaffinity() into a helper function Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-26 16:20   ` Peter Zijlstra
  2021-05-26 16:30   ` Peter Zijlstra
  2021-05-25 15:14 ` [PATCH v7 14/22] sched: Introduce task_cpus_dl_admissible() to check proposed affinity Will Deacon
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask. Similarly, if a subsequent execve() reverts to
a compatible image, then the old affinity is restored if it is still
valid.

In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
{force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
and restore the affinity mask for a task based on the compatible CPUs.

Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/sched.h |   2 +
 kernel/sched/core.c   | 173 ++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h  |   1 +
 3 files changed, 160 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index db32d4f7e5b3..91a6cfeae242 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,6 +1691,8 @@ extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
 extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
 extern void release_user_cpus_ptr(struct task_struct *p);
+extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
+extern void relax_compatible_cpus_allowed_ptr(struct task_struct *p);
 #else
 static inline void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b7faca947a9..998ed1dbfd4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2353,26 +2353,21 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 }
 
 /*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * Called with both p->pi_lock and rq->lock held; drops both before returning.
  */
-static int __set_cpus_allowed_ptr(struct task_struct *p,
-				  const struct cpumask *new_mask,
-				  u32 flags)
+static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
+					 const struct cpumask *new_mask,
+					 u32 flags,
+					 struct rq *rq,
+					 struct rq_flags *rf)
+	__releases(rq->lock)
+	__releases(p->pi_lock)
 {
 	const struct cpumask *cpu_valid_mask = cpu_active_mask;
 	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
 	unsigned int dest_cpu;
-	struct rq_flags rf;
-	struct rq *rq;
 	int ret = 0;
 
-	rq = task_rq_lock(p, &rf);
 	update_rq_clock(rq);
 
 	if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
@@ -2426,20 +2421,166 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	return affine_move_task(rq, p, &rf, dest_cpu, flags);
+	if (flags & SCA_USER)
+		release_user_cpus_ptr(p);
+
+	return affine_move_task(rq, p, rf, dest_cpu, flags);
 
 out:
-	task_rq_unlock(rq, p, &rf);
+	task_rq_unlock(rq, p, rf);
 
 	return ret;
 }
 
+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely. The
+ * call is not atomic; no spinlocks may be held.
+ */
+static int __set_cpus_allowed_ptr(struct task_struct *p,
+				  const struct cpumask *new_mask, u32 flags)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &rf);
+	return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf);
+}
+
 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 {
 	return __set_cpus_allowed_ptr(p, new_mask, 0);
 }
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
+/*
+ * Change a given task's CPU affinity to the intersection of its current
+ * affinity mask and @subset_mask, writing the resulting mask to @new_mask
+ * and pointing @p->user_cpus_ptr to a copy of the old mask.
+ * If the resulting mask is empty, leave the affinity unchanged and return
+ * -EINVAL.
+ */
+static int restrict_cpus_allowed_ptr(struct task_struct *p,
+				     struct cpumask *new_mask,
+				     const struct cpumask *subset_mask)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int err;
+	struct cpumask *user_mask = NULL;
+
+	if (!p->user_cpus_ptr)
+		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
+
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Forcefully restricting the affinity of a deadline task is
+	 * likely to cause problems, so fail and noisily override the
+	 * mask entirely.
+	 */
+	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
+		err = -EPERM;
+		goto err_unlock;
+	}
+
+	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
+		err = -EINVAL;
+		goto err_unlock;
+	}
+
+	/*
+	 * We're about to butcher the task affinity, so keep track of what
+	 * the user asked for in case we're able to restore it later on.
+	 */
+	if (user_mask) {
+		cpumask_copy(user_mask, p->cpus_ptr);
+		p->user_cpus_ptr = user_mask;
+	}
+
+	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
+
+err_unlock:
+	task_rq_unlock(rq, p, &rf);
+	kfree(user_mask);
+	return err;
+}
+
+/*
+ * Restrict the CPU affinity of task @p so that it is a subset of
+ * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the
+ * old affinity mask. If the resulting mask is empty, we warn and walk
+ * up the cpuset hierarchy until we find a suitable mask.
+ */
+void force_compatible_cpus_allowed_ptr(struct task_struct *p)
+{
+	cpumask_var_t new_mask;
+	const struct cpumask *override_mask = task_cpu_possible_mask(p);
+
+	alloc_cpumask_var(&new_mask, GFP_KERNEL);
+
+	/*
+	 * __migrate_task() can fail silently in the face of concurrent
+	 * offlining of the chosen destination CPU, so take the hotplug
+	 * lock to ensure that the migration succeeds.
+	 */
+	cpus_read_lock();
+	if (!cpumask_available(new_mask))
+		goto out_set_mask;
+
+	if (!restrict_cpus_allowed_ptr(p, new_mask, override_mask))
+		goto out_free_mask;
+
+	/*
+	 * We failed to find a valid subset of the affinity mask for the
+	 * task, so override it based on its cpuset hierarchy.
+	 */
+	cpuset_cpus_allowed(p, new_mask);
+	override_mask = new_mask;
+
+out_set_mask:
+	if (printk_ratelimit()) {
+		printk_deferred("Overriding affinity for process %d (%s) to CPUs %*pbl\n",
+				task_pid_nr(p), p->comm,
+				cpumask_pr_args(override_mask));
+	}
+
+	WARN_ON(set_cpus_allowed_ptr(p, override_mask));
+out_free_mask:
+	cpus_read_unlock();
+	free_cpumask_var(new_mask);
+}
+
+static int
+__sched_setaffinity(struct task_struct *p, const struct cpumask *mask);
+
+/*
+ * Restore the affinity of a task @p which was previously restricted by a
+ * call to force_compatible_cpus_allowed_ptr(). This will clear (and free)
+ * @p->user_cpus_ptr.
+ */
+void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
+{
+	unsigned long flags;
+	struct cpumask *mask = p->user_cpus_ptr;
+
+	/*
+	 * Try to restore the old affinity mask. If this fails, then
+	 * we free the mask explicitly to avoid it being inherited across
+	 * a subsequent fork().
+	 */
+	if (!mask || !__sched_setaffinity(p, mask))
+		return;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	release_user_cpus_ptr(p);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+}
+
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 #ifdef CONFIG_SCHED_DEBUG
@@ -6819,7 +6960,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	}
 #endif
 again:
-	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK);
+	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
 	if (retval)
 		goto out_free_new_mask;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..29c35b51411b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1956,6 +1956,7 @@ extern struct task_struct *pick_next_task_idle(struct rq *rq);
 #define SCA_CHECK		0x01
 #define SCA_MIGRATE_DISABLE	0x02
 #define SCA_MIGRATE_ENABLE	0x04
+#define SCA_USER		0x08
 
 #ifdef CONFIG_SMP
 
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 14/22] sched: Introduce task_cpus_dl_admissible() to check proposed affinity
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (12 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 15/22] freezer: Add frozen_or_skipped() helper function Will Deacon
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

In preparation for restricting the affinity of a task during execve()
on arm64, introduce a new task_cpus_dl_admissible() helper function to
give an indication as to whether the restricted mask is admissible for
a deadline task.

Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/core.c   | 44 +++++++++++++++++++++++++++----------------
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91a6cfeae242..9b17d8cfa6ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1691,6 +1691,7 @@ extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
 extern int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node);
 extern void release_user_cpus_ptr(struct task_struct *p);
+extern bool task_cpus_dl_admissible(struct task_struct *p, const struct cpumask *mask);
 extern void force_compatible_cpus_allowed_ptr(struct task_struct *p);
 extern void relax_compatible_cpus_allowed_ptr(struct task_struct *p);
 #else
@@ -1713,6 +1714,11 @@ static inline void release_user_cpus_ptr(struct task_struct *p)
 {
 	WARN_ON(p->user_cpus_ptr);
 }
+
+static inline bool task_cpus_dl_admissible(struct task_struct *p, const struct cpumask *mask)
+{
+	return true;
+}
 #endif
 
 extern int yield_to(struct task_struct *p, bool preempt);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 998ed1dbfd4f..42e2aecf087c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6925,6 +6925,31 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	return retval;
 }
 
+#ifdef CONFIG_SMP
+bool task_cpus_dl_admissible(struct task_struct *p, const struct cpumask *mask)
+{
+	bool ret;
+
+	/*
+	 * If the task isn't a deadline task or admission control is
+	 * disabled then we don't care about affinity changes.
+	 */
+	if (!task_has_dl_policy(p) || !dl_bandwidth_enabled())
+		return true;
+
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+	rcu_read_lock();
+	ret = cpumask_subset(task_rq(p)->rd->span, mask);
+	rcu_read_unlock();
+	return ret;
+}
+#endif
+
 static int
 __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 {
@@ -6942,23 +6967,10 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, mask, cpus_allowed);
 
-	/*
-	 * Since bandwidth control happens on root_domain basis,
-	 * if admission test is enabled, we only admit -deadline
-	 * tasks allowed to run on all the CPUs in the task's
-	 * root_domain.
-	 */
-#ifdef CONFIG_SMP
-	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
-		rcu_read_lock();
-		if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
-			retval = -EBUSY;
-			rcu_read_unlock();
-			goto out_free_new_mask;
-		}
-		rcu_read_unlock();
+	if (!task_cpus_dl_admissible(p, new_mask)) {
+		retval = -EBUSY;
+		goto out_free_new_mask;
 	}
-#endif
 again:
 	retval = __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER);
 	if (retval)
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 15/22] freezer: Add frozen_or_skipped() helper function
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (13 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 14/22] sched: Introduce task_cpus_dl_admissible() to check proposed affinity Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks Will Deacon
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Occasionally it is necessary to see if a task is either frozen or
sleeping in the PF_FREEZER_SKIP state. In preparation for adding
additional users of this check, introduce a frozen_or_skipped() helper
function and convert the hung task detector over to using it.

Signed-off-by: Will Deacon <will@kernel.org>
---
 include/linux/freezer.h | 6 ++++++
 kernel/hung_task.c      | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..b9e1e4200101 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -27,6 +27,11 @@ static inline bool frozen(struct task_struct *p)
 	return p->flags & PF_FROZEN;
 }
 
+static inline bool frozen_or_skipped(struct task_struct *p)
+{
+	return p->flags & (PF_FROZEN | PF_FREEZER_SKIP);
+}
+
 extern bool freezing_slow_path(struct task_struct *p);
 
 /*
@@ -270,6 +275,7 @@ static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
 
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
+static inline bool frozen_or_skipped(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
 static inline void __thaw_task(struct task_struct *t) {}
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..d2d4c4159b23 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(frozen_or_skipped(t)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (14 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 15/22] freezer: Add frozen_or_skipped() helper function Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-27 14:10   ` Peter Zijlstra
  2021-06-01  8:21   ` [RFC][PATCH] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  2021-05-25 15:14 ` [PATCH v7 17/22] arm64: Implement task_cpu_possible_mask() Will Deacon
                   ` (5 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.

Although we take care to prevent explicit hot-unplug of all 32-bit
capable CPUs on such a system, this is required when suspending on some
SoCs where the firmware mandates that the suspend/resume operation is
handled by CPU 0, which may not be capable of running 32-bit tasks.

Consequently, there is a window on the resume path where no 32-bit
capable CPUs are available for scheduling and waking up a 32-bit task
will result in a scheduler BUG() due to failure of select_fallback_rq():

  | kernel BUG at kernel/sched/core.c:2858!
  | Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
  | ...
  | Call trace:
  |  select_fallback_rq+0x4b0/0x4e4
  |  try_to_wake_up.llvm.4388853297126348405+0x460/0x5b0
  |  default_wake_function+0x1c/0x30
  |  autoremove_wake_function+0x1c/0x60
  |  __wake_up_common.llvm.11763074518265335900+0x100/0x1b8
  |  __wake_up+0x78/0xc4
  |  ep_poll_callback+0x20c/0x3fc

Prevent wakeups of unschedulable frozen tasks in ttwu() and instead
defer the wakeup to __thaw_tasks(), which runs only once all the
secondary CPUs are back online.

Signed-off-by: Will Deacon <will@kernel.org>
---
 kernel/freezer.c    | 10 +++++++++-
 kernel/sched/core.c | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..8f3d950c2a87 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -11,6 +11,7 @@
 #include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/kthread.h>
+#include <linux/mmu_context.h>
 
 /* total number of freezing conditions in effect */
 atomic_t system_freezing_cnt = ATOMIC_INIT(0);
@@ -146,9 +147,16 @@ bool freeze_task(struct task_struct *p)
 void __thaw_task(struct task_struct *p)
 {
 	unsigned long flags;
+	const struct cpumask *mask = task_cpu_possible_mask(p);
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
+	/*
+	 * Wake up frozen tasks. On asymmetric systems where tasks cannot
+	 * run on all CPUs, ttwu() may have deferred a wakeup generated
+	 * before thaw_secondary_cpus() had completed so we generate
+	 * additional wakeups here for tasks in the PF_FREEZER_SKIP state.
+	 */
+	if (frozen(p) || (frozen_or_skipped(p) && mask != cpu_possible_mask))
 		wake_up_process(p);
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42e2aecf087c..6cb9677d635a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3529,6 +3529,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (!(p->state & state))
 		goto unlock;
 
+#ifdef CONFIG_FREEZER
+	/*
+	 * If we're going to wake up a thread which may be frozen, then
+	 * we can only do so if we have an active CPU which is capable of
+	 * running it. This may not be the case when resuming from suspend,
+	 * as the secondary CPUs may not yet be back online. See __thaw_task()
+	 * for the actual wakeup.
+	 */
+	if (unlikely(frozen_or_skipped(p)) &&
+	    !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
+		goto unlock;
+#endif
+
 	trace_sched_waking(p);
 
 	/* We're going to change ->state: */
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 17/22] arm64: Implement task_cpu_possible_mask()
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (15 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 18/22] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0 Will Deacon
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Provide an implementation of task_cpu_possible_mask() so that we can
prevent 64-bit-only cores being added to the 'cpus_mask' for compat
tasks on systems with mismatched 32-bit support at EL0,

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/mmu_context.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index d3cef9133539..bb9b7510f334 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -231,6 +231,19 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	update_saved_ttbr0(tsk, next);
 }
 
+static inline const struct cpumask *
+task_cpu_possible_mask(struct task_struct *p)
+{
+	if (!static_branch_unlikely(&arm64_mismatched_32bit_el0))
+		return cpu_possible_mask;
+
+	if (!is_compat_thread(task_thread_info(p)))
+		return cpu_possible_mask;
+
+	return system_32bit_el0_cpumask();
+}
+#define task_cpu_possible_mask	task_cpu_possible_mask
+
 void verify_cpu_asid_bits(void);
 void post_ttbr_update_workaround(void);
 
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 18/22] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (16 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 17/22] arm64: Implement task_cpu_possible_mask() Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 19/22] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system Will Deacon
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

When exec'ing a 32-bit task on a system with mismatched support for
32-bit EL0, try to ensure that it starts life on a CPU that can actually
run it.

Similarly, when exec'ing a 64-bit task on such a system, try to restore
the old affinity mask if it was previously restricted.

Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/elf.h |  6 ++----
 arch/arm64/kernel/process.c  | 39 +++++++++++++++++++++++++++++++++++-
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 8d1c8dcb87fd..97932fbf973d 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -213,10 +213,8 @@ typedef compat_elf_greg_t		compat_elf_gregset_t[COMPAT_ELF_NGREG];
 
 /* AArch32 EABI. */
 #define EF_ARM_EABI_MASK		0xff000000
-#define compat_elf_check_arch(x)	(system_supports_32bit_el0() && \
-					 ((x)->e_machine == EM_ARM) && \
-					 ((x)->e_flags & EF_ARM_EABI_MASK))
-
+int compat_elf_check_arch(const struct elf32_hdr *);
+#define compat_elf_check_arch		compat_elf_check_arch
 #define compat_start_thread		compat_start_thread
 /*
  * Unlike the native SET_PERSONALITY macro, the compat version maintains
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index f4a91bf1ce0c..3aea06fdd1f9 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -22,6 +22,7 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/nospec.h>
+#include <linux/sched.h>
 #include <linux/stddef.h>
 #include <linux/sysctl.h>
 #include <linux/unistd.h>
@@ -638,6 +639,28 @@ unsigned long arch_align_stack(unsigned long sp)
 	return sp & ~0xf;
 }
 
+#ifdef CONFIG_COMPAT
+int compat_elf_check_arch(const struct elf32_hdr *hdr)
+{
+	if (!system_supports_32bit_el0())
+		return false;
+
+	if ((hdr)->e_machine != EM_ARM)
+		return false;
+
+	if (!((hdr)->e_flags & EF_ARM_EABI_MASK))
+		return false;
+
+	/*
+	 * Prevent execve() of a 32-bit program from a deadline task
+	 * if the restricted affinity mask would be inadmissible on an
+	 * asymmetric system.
+	 */
+	return !static_branch_unlikely(&arm64_mismatched_32bit_el0) ||
+	       task_cpus_dl_admissible(current, system_32bit_el0_cpumask());
+}
+#endif
+
 /*
  * Called from setup_new_exec() after (COMPAT_)SET_PERSONALITY.
  */
@@ -647,8 +670,22 @@ void arch_setup_new_exec(void)
 
 	if (is_compat_task()) {
 		mmflags = MMCF_AARCH32;
-		if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+
+		/*
+		 * Restrict the CPU affinity mask for a 32-bit task so that
+		 * it contains only 32-bit-capable CPUs.
+		 *
+		 * From the perspective of the task, this looks similar to
+		 * what would happen if the 64-bit-only CPUs were hot-unplugged
+		 * at the point of execve(), although we try a bit harder to
+		 * honour the cpuset hierarchy.
+		 */
+		if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+			force_compatible_cpus_allowed_ptr(current);
 			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+		}
+	} else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+		relax_compatible_cpus_allowed_ptr(current);
 	}
 
 	current->mm->context.flags = mmflags;
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 19/22] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (17 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 18/22] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0 Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 20/22] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0 Will Deacon
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

If we want to support 32-bit applications, then when we identify a CPU
with mismatched 32-bit EL0 support we must ensure that we will always
have an active 32-bit CPU available to us from then on. This is important
for the scheduler, because is_cpu_allowed() will be constrained to 32-bit
CPUs for compat tasks and forced migration due to a hotplug event will
hang if no 32-bit CPUs are available.

On detecting a mismatch, prevent offlining of either the mismatching CPU
if it is 32-bit capable, or find the first active 32-bit capable CPU
otherwise.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/cpufeature.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 959442f76ed7..72efdc611b14 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2896,15 +2896,33 @@ void __init setup_cpu_features(void)
 
 static int enable_mismatched_32bit_el0(unsigned int cpu)
 {
+	static int lucky_winner = -1;
+
 	struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
 	bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
 
 	if (cpu_32bit) {
 		cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
 		static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
-		setup_elf_hwcaps(compat_elf_hwcaps);
 	}
 
+	if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
+		return 0;
+
+	if (lucky_winner >= 0)
+		return 0;
+
+	/*
+	 * We've detected a mismatch. We need to keep one of our CPUs with
+	 * 32-bit EL0 online so that is_cpu_allowed() doesn't end up rejecting
+	 * every CPU in the system for a 32-bit task.
+	 */
+	lucky_winner = cpu_32bit ? cpu : cpumask_any_and(cpu_32bit_el0_mask,
+							 cpu_active_mask);
+	get_cpu_device(lucky_winner)->offline_disabled = true;
+	setup_elf_hwcaps(compat_elf_hwcaps);
+	pr_info("Asymmetric 32-bit EL0 support detected on CPU %u; CPU hot-unplug disabled on CPU %u\n",
+		cpu, lucky_winner);
 	return 0;
 }
 
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 20/22] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (18 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 19/22] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 21/22] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores Will Deacon
  2021-05-25 15:14 ` [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support Will Deacon
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Allow systems with mismatched 32-bit support at EL0 to run 32-bit
applications based on a new kernel parameter.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++
 arch/arm64/kernel/cpufeature.c                  | 7 +++++++
 2 files changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cb89dbdedc46..a2e453919bb6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -287,6 +287,14 @@
 			do not want to use tracing_snapshot_alloc() as it needs
 			to be done where GFP_KERNEL allocations are allowed.
 
+	allow_mismatched_32bit_el0 [ARM64]
+			Allow execve() of 32-bit applications and setting of the
+			PER_LINUX32 personality on systems where only a strict
+			subset of the CPUs support 32-bit EL0. When this
+			parameter is present, the set of CPUs supporting 32-bit
+			EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
+			and hot-unplug operations may be restricted.
+
 	amd_iommu=	[HW,X86-64]
 			Pass parameters to the AMD IOMMU driver in the system.
 			Possible values are:
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 72efdc611b14..f2c97baa050f 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1298,6 +1298,13 @@ const struct cpumask *system_32bit_el0_cpumask(void)
 	return cpu_possible_mask;
 }
 
+static int __init parse_32bit_el0_param(char *str)
+{
+	allow_mismatched_32bit_el0 = true;
+	return 0;
+}
+early_param("allow_mismatched_32bit_el0", parse_32bit_el0_param);
+
 static ssize_t aarch32_el0_show(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 21/22] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (19 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 20/22] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0 Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 15:14 ` [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support Will Deacon
  21 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

The scheduler now knows enough about these braindead systems to place
32-bit tasks accordingly, so throw out the safety checks and allow the
ret-to-user path to avoid do_notify_resume() if there is nothing to do.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/process.c | 14 +-------------
 arch/arm64/kernel/signal.c  | 26 --------------------------
 2 files changed, 1 insertion(+), 39 deletions(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 3aea06fdd1f9..5581c376644e 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -528,15 +528,6 @@ static void erratum_1418040_thread_switch(struct task_struct *prev,
 	write_sysreg(val, cntkctl_el1);
 }
 
-static void compat_thread_switch(struct task_struct *next)
-{
-	if (!is_compat_thread(task_thread_info(next)))
-		return;
-
-	if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
-		set_tsk_thread_flag(next, TIF_NOTIFY_RESUME);
-}
-
 static void update_sctlr_el1(u64 sctlr)
 {
 	/*
@@ -578,7 +569,6 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
 	ssbs_thread_switch(next);
 	erratum_1418040_thread_switch(prev, next);
 	ptrauth_thread_switch_user(next);
-	compat_thread_switch(next);
 
 	/*
 	 * Complete any pending TLB or cache maintenance on this CPU in case
@@ -680,10 +670,8 @@ void arch_setup_new_exec(void)
 		 * at the point of execve(), although we try a bit harder to
 		 * honour the cpuset hierarchy.
 		 */
-		if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
+		if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
 			force_compatible_cpus_allowed_ptr(current);
-			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
-		}
 	} else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
 		relax_compatible_cpus_allowed_ptr(current);
 	}
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f8192f4ae0b8..6237486ff6bb 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -911,19 +911,6 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
-static bool cpu_affinity_invalid(struct pt_regs *regs)
-{
-	if (!compat_user_mode(regs))
-		return false;
-
-	/*
-	 * We're preemptible, but a reschedule will cause us to check the
-	 * affinity again.
-	 */
-	return !cpumask_test_cpu(raw_smp_processor_id(),
-				 system_32bit_el0_cpumask());
-}
-
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -951,19 +938,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 			if (thread_flags & _TIF_NOTIFY_RESUME) {
 				tracehook_notify_resume(regs);
 				rseq_handle_notify_resume(NULL, regs);
-
-				/*
-				 * If we reschedule after checking the affinity
-				 * then we must ensure that TIF_NOTIFY_RESUME
-				 * is set so that we check the affinity again.
-				 * Since tracehook_notify_resume() clears the
-				 * flag, ensure that the compiler doesn't move
-				 * it after the affinity check.
-				 */
-				barrier();
-
-				if (cpu_affinity_invalid(regs))
-					force_sig(SIGKILL);
 			}
 
 			if (thread_flags & _TIF_FOREIGN_FPSTATE)
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support
  2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
                   ` (20 preceding siblings ...)
  2021-05-25 15:14 ` [PATCH v7 21/22] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores Will Deacon
@ 2021-05-25 15:14 ` Will Deacon
  2021-05-25 17:13   ` Marc Zyngier
  21 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-25 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Document support for running 32-bit tasks on asymmetric 32-bit systems
and its impact on the user ABI when enabled.

Signed-off-by: Will Deacon <will@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         |   3 +
 Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++++++++++++
 Documentation/arm64/index.rst                 |   1 +
 3 files changed, 158 insertions(+)
 create mode 100644 Documentation/arm64/asymmetric-32bit.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a2e453919bb6..5a1dc7e628a5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -295,6 +295,9 @@
 			EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
 			and hot-unplug operations may be restricted.
 
+			See Documentation/arm64/asymmetric-32bit.rst for more
+			information.
+
 	amd_iommu=	[HW,X86-64]
 			Pass parameters to the AMD IOMMU driver in the system.
 			Possible values are:
diff --git a/Documentation/arm64/asymmetric-32bit.rst b/Documentation/arm64/asymmetric-32bit.rst
new file mode 100644
index 000000000000..a70a2b97e60b
--- /dev/null
+++ b/Documentation/arm64/asymmetric-32bit.rst
@@ -0,0 +1,154 @@
+======================
+Asymmetric 32-bit SoCs
+======================
+
+Author: Will Deacon <will@kernel.org>
+
+This document describes the impact of asymmetric 32-bit SoCs on the
+execution of 32-bit (``AArch32``) applications.
+
+Date: 2021-05-17
+
+Introduction
+============
+
+Some Armv9 SoCs suffer from a big.LITTLE misfeature where only a subset
+of the CPUs are capable of executing 32-bit user applications. On such
+a system, Linux by default treats the asymmetry as a "mismatch" and
+disables support for both the ``PER_LINUX32`` personality and
+``execve(2)`` of 32-bit ELF binaries, with the latter returning
+``-ENOEXEC``. If the mismatch is detected during late onlining of a
+64-bit-only CPU, then the onlining operation fails and the new CPU is
+unavailable for scheduling.
+
+Surprisingly, these SoCs have been produced with the intention of
+running legacy 32-bit binaries. Unsurprisingly, that doesn't work very
+well with the default behaviour of Linux.
+
+It seems inevitable that future SoCs will drop 32-bit support
+altogether, so if you're stuck in the unenviable position of needing to
+run 32-bit code on one of these transitionary platforms then you would
+be wise to consider alternatives such as recompilation, emulation or
+retirement. If neither of those options are practical, then read on.
+
+Enabling kernel support
+=======================
+
+Since the kernel support is not completely transparent to userspace,
+allowing 32-bit tasks to run on an asymmetric 32-bit system requires an
+explicit "opt-in" and can be enabled by passing the
+``allow_mismatched_32bit_el0`` parameter on the kernel command-line.
+
+For the remainder of this document we will refer to an *asymmetric
+system* to mean an asymmetric 32-bit SoC running Linux with this kernel
+command-line option enabled.
+
+Userspace impact
+================
+
+32-bit tasks running on an asymmetric system behave in mostly the same
+way as on a homogeneous system, with a few key differences relating to
+CPU affinity.
+
+sysfs
+-----
+
+The subset of CPUs capable of running 32-bit tasks is described in
+``/sys/devices/system/cpu/aarch32_el0`` and is documented further in
+``Documentation/ABI/testing/sysfs-devices-system-cpu``.
+
+**Note:** CPUs are advertised by this file as they are detected and so
+late-onlining of 32-bit-capable CPUs can result in the file contents
+being modified by the kernel at runtime. Once advertised, CPUs are never
+removed from the file.
+
+``execve(2)``
+-------------
+
+On a homogeneous system, the CPU affinity of a task is preserved across
+``execve(2)``. This is not always possible on an asymmetric system,
+specifically when the new program being executed is 32-bit yet the
+affinity mask contains 64-bit-only CPUs. In this situation, the kernel
+determines the new affinity mask as follows:
+
+  1. If the 32-bit-capable subset of the affinity mask is not empty,
+     then the affinity is restricted to that subset and the old affinity
+     mask is saved. This saved mask is inherited over ``fork(2)`` and
+     preserved across ``execve(2)`` of 32-bit programs.
+
+     **Note:** This step does not apply to ``SCHED_DEADLINE`` tasks.
+     See `SCHED_DEADLINE`_.
+
+  2. Otherwise, the cpuset hierarchy of the task is walked until an
+     ancestor is found containing at least one 32-bit-capable CPU. The
+     affinity of the task is then changed to match the 32-bit-capable
+     subset of the cpuset determined by the walk.
+
+  3. On failure (i.e. out of memory), the affinity is changed to the set
+     of all 32-bit-capable CPUs of which the kernel is aware.
+
+A subsequent ``execve(2)`` of a 64-bit program by the 32-bit task will
+invalidate the affinity mask saved in (1) and attempt to restore the CPU
+affinity of the task using the saved mask if it was previously valid.
+This restoration may fail due to intervening changes to the deadline
+policy or cpuset hierarchy, in which case the ``execve(2)`` continues
+with the affinity unchanged.
+
+Calls to ``sched_setaffinity(2)`` for a 32-bit task will consider only
+the 32-bit-capable CPUs of the requested affinity mask. On success, the
+affinity for the task is updated and any saved mask from a prior
+``execve(2)`` is invalidated.
+
+``SCHED_DEADLINE``
+------------------
+
+Explicit admission of a 32-bit deadline task to the default root domain
+(e.g. by calling ``sched_setattr(2)``) is rejected on an asymmetric
+32-bit system unless admission control is disabled by writing -1 to
+``/proc/sys/kernel/sched_rt_runtime_us``.
+
+``execve(2)`` of a 32-bit program from a 64-bit deadline task will
+return ``-ENOEXEC`` if the root domain for the task contains any
+64-bit-only CPUs and admission control is enabled. Concurrent offlining
+of 32-bit-capable CPUs may still necessitate the procedure described in
+`execve(2)`_, in which case step (1) is skipped and a warning is
+emitted on the console.
+
+**Note:** It is recommended that a set of 32-bit-capable CPUs are placed
+into a separate root domain if ``SCHED_DEADLINE`` is to be used with
+32-bit tasks on an asymmetric system. Failure to do so is likely to
+result in missed deadlines.
+
+Cpusets
+-------
+
+The affinity of a 32-bit task on an asymmetric system may include CPUs
+that are not explicitly allowed by the cpuset to which it is attached.
+This can occur as a result of the following two situations:
+
+  - A 64-bit task attached to a cpuset which allows only 64-bit CPUs
+    executes a 32-bit program.
+
+  - All of the 32-bit-capable CPUs allowed by a cpuset containing a
+    32-bit task are offlined.
+
+In both of these cases, the new affinity is calculated according to step
+(2) of the process described in `execve(2)`_ and the cpuset hierarchy is
+unchanged irrespective of the cgroup version.
+
+CPU hotplug
+-----------
+
+On an asymmetric system, the first detected 32-bit-capable CPU is
+prevented from being offlined by userspace and any such attempt will
+return ``-EPERM``. Note that suspend is still permitted even if the
+primary CPU (i.e. CPU 0) is 64-bit-only.
+
+KVM
+---
+
+Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
+asymmetric system, a broken guest at EL1 could still attempt to execute
+32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
+mode will return to host userspace with an ``exit_reason`` of
+``KVM_EXIT_FAIL_ENTRY``.
diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst
index 97d65ba12a35..4f840bac083e 100644
--- a/Documentation/arm64/index.rst
+++ b/Documentation/arm64/index.rst
@@ -10,6 +10,7 @@ ARM64 Architecture
     acpi_object_usage
     amu
     arm-acpi
+    asymmetric-32bit
     booting
     cpu-feature-registers
     elf_hwcaps
-- 
2.31.1.818.g46aad6cb9e-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support
  2021-05-25 15:14 ` [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support Will Deacon
@ 2021-05-25 17:13   ` Marc Zyngier
  2021-05-25 17:27     ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Marc Zyngier @ 2021-05-25 17:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Greg Kroah-Hartman, Peter Zijlstra, Morten Rasmussen,
	Qais Yousef, Suren Baghdasaryan, Quentin Perret, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Rafael J. Wysocki, Dietmar Eggemann, Daniel Bristot de Oliveira,
	kernel-team

On Tue, 25 May 2021 16:14:32 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> Document support for running 32-bit tasks on asymmetric 32-bit systems
> and its impact on the user ABI when enabled.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 +
>  Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++++++++++++
>  Documentation/arm64/index.rst                 |   1 +
>  3 files changed, 158 insertions(+)
>  create mode 100644 Documentation/arm64/asymmetric-32bit.rst
>

[...]

> +KVM
> +---
> +
> +Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
> +asymmetric system, a broken guest at EL1 could still attempt to execute
> +32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
> +mode will return to host userspace with an ``exit_reason`` of
> +``KVM_EXIT_FAIL_ENTRY``.

Nit: there is a bit more to it. The vcpu will be left in a permanent
non-runnable state until KVM_ARM_VCPU_INIT is issued to reset the vcpu
into a saner state.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support
  2021-05-25 17:13   ` Marc Zyngier
@ 2021-05-25 17:27     ` Will Deacon
  2021-05-25 18:11       ` Marc Zyngier
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-25 17:27 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Greg Kroah-Hartman, Peter Zijlstra, Morten Rasmussen,
	Qais Yousef, Suren Baghdasaryan, Quentin Perret, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Rafael J. Wysocki, Dietmar Eggemann, Daniel Bristot de Oliveira,
	kernel-team

On Tue, May 25, 2021 at 06:13:58PM +0100, Marc Zyngier wrote:
> On Tue, 25 May 2021 16:14:32 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > Document support for running 32-bit tasks on asymmetric 32-bit systems
> > and its impact on the user ABI when enabled.
> > 
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   3 +
> >  Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++++++++++++
> >  Documentation/arm64/index.rst                 |   1 +
> >  3 files changed, 158 insertions(+)
> >  create mode 100644 Documentation/arm64/asymmetric-32bit.rst
> >
> 
> [...]
> 
> > +KVM
> > +---
> > +
> > +Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
> > +asymmetric system, a broken guest at EL1 could still attempt to execute
> > +32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
> > +mode will return to host userspace with an ``exit_reason`` of
> > +``KVM_EXIT_FAIL_ENTRY``.
> 
> Nit: there is a bit more to it. The vcpu will be left in a permanent
> non-runnable state until KVM_ARM_VCPU_INIT is issued to reset the vcpu
> into a saner state.

Thanks, I'll add "and will remain non-runnable until re-initialised by a
subsequent KVM_ARM_VCPU_INIT operation".

Can the VMM tell that it needs to do that? I wonder if we should be
setting 'hardware_entry_failure_reason' to distinguish this case.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support
  2021-05-25 17:27     ` Will Deacon
@ 2021-05-25 18:11       ` Marc Zyngier
  2021-05-26 16:00         ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Marc Zyngier @ 2021-05-25 18:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Greg Kroah-Hartman, Peter Zijlstra, Morten Rasmussen,
	Qais Yousef, Suren Baghdasaryan, Quentin Perret, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Rafael J. Wysocki, Dietmar Eggemann, Daniel Bristot de Oliveira,
	kernel-team

On Tue, 25 May 2021 18:27:03 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Tue, May 25, 2021 at 06:13:58PM +0100, Marc Zyngier wrote:
> > On Tue, 25 May 2021 16:14:32 +0100,
> > Will Deacon <will@kernel.org> wrote:
> > > 
> > > Document support for running 32-bit tasks on asymmetric 32-bit systems
> > > and its impact on the user ABI when enabled.
> > > 
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  .../admin-guide/kernel-parameters.txt         |   3 +
> > >  Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++++++++++++
> > >  Documentation/arm64/index.rst                 |   1 +
> > >  3 files changed, 158 insertions(+)
> > >  create mode 100644 Documentation/arm64/asymmetric-32bit.rst
> > >
> > 
> > [...]
> > 
> > > +KVM
> > > +---
> > > +
> > > +Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
> > > +asymmetric system, a broken guest at EL1 could still attempt to execute
> > > +32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
> > > +mode will return to host userspace with an ``exit_reason`` of
> > > +``KVM_EXIT_FAIL_ENTRY``.
> > 
> > Nit: there is a bit more to it. The vcpu will be left in a permanent
> > non-runnable state until KVM_ARM_VCPU_INIT is issued to reset the vcpu
> > into a saner state.
> 
> Thanks, I'll add "and will remain non-runnable until re-initialised by a
> subsequent KVM_ARM_VCPU_INIT operation".

Looks good.

> Can the VMM tell that it needs to do that? I wonder if we should be
> setting 'hardware_entry_failure_reason' to distinguish this case.

The VMM should be able to notice that something is amiss, as any
subsequent KVM_RUN calls will result in -ENOEXEC being returned, and
we document this as "the vcpu hasn't been initialized or the guest
tried to execute instructions from device memory (arm64)".

However, there is another reason to get a "FAILED_ENTRY", and that if
we get an Illegal Exception Return exception when entering the
guest. That one should always be a KVM bug.

So yeah, maybe there is some ground to populate that structure with
the appropriate nastygram (completely untested).

	M.

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..cf50051a9412 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -402,6 +402,10 @@ struct kvm_vcpu_events {
 #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
 #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
 
+/* KVM_EXIT_FAIL_ENTRY reasons */
+#define KVM_ARM64_FAILED_ENTRY_NO_AARCH32_ALLOWED	0xBADBAD32
+#define KVM_ARM64_FAILED_ENTRY_INTERNAL_ERROR		0xE1215BAD
+
 #endif
 
 #endif /* __ARM_KVM_H__ */
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 6f48336b1d86..e97cd4de1fa7 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -262,6 +262,10 @@ int handle_exit(struct kvm_vcpu *vcpu, int exception_index)
 		 * have been corrupted somehow.  Give up.
 		 */
 		run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+		run->fail_entry.hardware_entry_failure_reason = (vcpu->arch.target == -1) ?
+			KVM_ARM64_FAILED_ENTRY_NO_AARCH32_ALLOWED :
+			KVM_ARM64_FAILED_ENTRY_INTERNAL_ERROR;
+		run->fail_entry.cpu = vcpu->cpu;
 		return -EINVAL;
 	default:
 		kvm_pr_unimpl("Unsupported exception type: %d",

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-25 15:14 ` [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination Will Deacon
@ 2021-05-26 11:14   ` Valentin Schneider
  2021-05-26 12:32     ` Peter Zijlstra
  2021-05-26 16:03     ` Will Deacon
  0 siblings, 2 replies; 57+ messages in thread
From: Valentin Schneider @ 2021-05-26 11:14 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel
  Cc: linux-arch, linux-kernel, Will Deacon, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

Hi,

On 25/05/21 16:14, Will Deacon wrote:
> Since commit 6d337eab041d ("sched: Fix migrate_disable() vs
> set_cpus_allowed_ptr()"), the migration stopper thread is left to
> determine the destination CPU of the running task being migrated, even
> though set_cpus_allowed_ptr() already identified a candidate target
> earlier on.
>
> Unfortunately, the stopper doesn't check whether or not the new
> destination CPU is active or not, so __migrate_task() can leave the task
> sitting on a CPU that is outside of its affinity mask, even if the CPU
> originally chosen by SCA is still active.
>
> For example, with CONFIG_CPUSET=n:
>
>  $ taskset -pc 0-2 $PID
>  # offline CPUs 3-4
>  $ taskset -pc 3-5 $PID
>
> Then $PID remains on its current CPU (one of 0-2) and does not get
> migrated to CPU 5.
>
> Rework 'struct migration_arg' so that an optional pointer to an affinity
> mask can be provided to the stopper, allowing us to respect the
> original choice of destination CPU when migrating. Note that there is
> still the potential to race with a concurrent CPU hot-unplug of the
> destination CPU if the caller does not hold the hotplug lock.
>
> Reported-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  kernel/sched/core.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..1702a60d178d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1869,6 +1869,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
>  struct migration_arg {
>       struct task_struct		*task;
>       int				dest_cpu;
> +	const struct cpumask		*dest_mask;
>       struct set_affinity_pending	*pending;
>  };
>
> @@ -1917,6 +1918,7 @@ static int migration_cpu_stop(void *data)
>       struct set_affinity_pending *pending = arg->pending;
>       struct task_struct *p = arg->task;
>       int dest_cpu = arg->dest_cpu;
> +	const struct cpumask *dest_mask = arg->dest_mask;
>       struct rq *rq = this_rq();
>       bool complete = false;
>       struct rq_flags rf;
> @@ -1956,12 +1958,8 @@ static int migration_cpu_stop(void *data)
>                       complete = true;
>               }
>
> -		if (dest_cpu < 0) {
> -			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
> -				goto out;
> -
> -			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
> -		}
> +		if (dest_mask && (cpumask_test_cpu(task_cpu(p), dest_mask)))
> +			goto out;
>

IIRC the reason we deferred the pick to migration_cpu_stop() was because of
those insane races involving multiple SCA calls the likes of:

  p->cpus_mask = [0, 1]; p on CPU0

  CPUx                           CPUy                   CPU0

  SCA(p, [2])
    __do_set_cpus_allowed();
    queue migration_cpu_stop()
                                 SCA(p, [3])
                                   __do_set_cpus_allowed();
                                                        migration_cpu_stop()

The stopper needs to use the latest cpumask set by the second SCA despite
having an arg->pending set up by the first SCA. Doesn't this break here?

I'm not sure I've paged back in all of the subtleties laying in ambush
here, but what about the below?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..cd447c9db61d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1916,7 +1916,6 @@ static int migration_cpu_stop(void *data)
 	struct migration_arg *arg = data;
 	struct set_affinity_pending *pending = arg->pending;
 	struct task_struct *p = arg->task;
-	int dest_cpu = arg->dest_cpu;
 	struct rq *rq = this_rq();
 	bool complete = false;
 	struct rq_flags rf;
@@ -1954,19 +1953,15 @@ static int migration_cpu_stop(void *data)
 		if (pending) {
 			p->migration_pending = NULL;
 			complete = true;
-		}
 
-		if (dest_cpu < 0) {
 			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
 				goto out;
-
-			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
 		}
 
 		if (task_on_rq_queued(p))
-			rq = __migrate_task(rq, &rf, p, dest_cpu);
+			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
 		else
-			p->wake_cpu = dest_cpu;
+			p->wake_cpu = arg->dest_cpu;
 
 		/*
 		 * XXX __migrate_task() can fail, at which point we might end
@@ -2249,7 +2244,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 			init_completion(&my_pending.done);
 			my_pending.arg = (struct migration_arg) {
 				.task = p,
-				.dest_cpu = -1,		/* any */
+				.dest_cpu = dest_cpu,
 				.pending = &my_pending,
 			};
 
@@ -2257,6 +2252,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		} else {
 			pending = p->migration_pending;
 			refcount_inc(&pending->refs);
+			pending->arg.dest_cpu = dest_cpu;
 		}
 	}
 	pending = p->migration_pending;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-26 11:14   ` Valentin Schneider
@ 2021-05-26 12:32     ` Peter Zijlstra
  2021-05-26 12:36       ` Valentin Schneider
  2021-05-26 16:03     ` Will Deacon
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 12:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Will Deacon, linux-arm-kernel, linux-arch, linux-kernel,
	Catalin Marinas, Marc Zyngier, Greg Kroah-Hartman,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 12:14:20PM +0100, Valentin Schneider wrote:
> On 25/05/21 16:14, Will Deacon wrote:

> > @@ -1956,12 +1958,8 @@ static int migration_cpu_stop(void *data)
> >                       complete = true;
> >               }
> >
> > -		if (dest_cpu < 0) {
> > -			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
> > -				goto out;
> > -
> > -			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
> > -		}
> > +		if (dest_mask && (cpumask_test_cpu(task_cpu(p), dest_mask)))
> > +			goto out;
> >
> 
> IIRC the reason we deferred the pick to migration_cpu_stop() was because of
> those insane races involving multiple SCA calls the likes of:
> 
>   p->cpus_mask = [0, 1]; p on CPU0
> 
>   CPUx                           CPUy                   CPU0
> 
>   SCA(p, [2])
>     __do_set_cpus_allowed();
>     queue migration_cpu_stop()
>                                  SCA(p, [3])
>                                    __do_set_cpus_allowed();
>                                                         migration_cpu_stop()
> 
> The stopper needs to use the latest cpumask set by the second SCA despite
> having an arg->pending set up by the first SCA. Doesn't this break here?

Yep.

> I'm not sure I've paged back in all of the subtleties laying in ambush
> here, but what about the below?
> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..cd447c9db61d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> @@ -1954,19 +1953,15 @@ static int migration_cpu_stop(void *data)
>  		if (pending) {
>  			p->migration_pending = NULL;
>  			complete = true;
>  
>  			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
>  				goto out;
>  		}
>  
>  		if (task_on_rq_queued(p))
> +			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);

> @@ -2249,7 +2244,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  			init_completion(&my_pending.done);
>  			my_pending.arg = (struct migration_arg) {
>  				.task = p,
> +				.dest_cpu = dest_cpu,
>  				.pending = &my_pending,
>  			};
>  
> @@ -2257,6 +2252,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  		} else {
>  			pending = p->migration_pending;
>  			refcount_inc(&pending->refs);
> +			pending->arg.dest_cpu = dest_cpu;
>  		}
>  	}

Argh.. that might just work. But I'm thinking we wants comments this
time around :-) This is even more subtle.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-26 12:32     ` Peter Zijlstra
@ 2021-05-26 12:36       ` Valentin Schneider
  0 siblings, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2021-05-26 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-arm-kernel, linux-arch, linux-kernel,
	Catalin Marinas, Marc Zyngier, Greg Kroah-Hartman,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

On 26/05/21 14:32, Peter Zijlstra wrote:
> On Wed, May 26, 2021 at 12:14:20PM +0100, Valentin Schneider wrote:
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5226cc26a095..cd447c9db61d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>
>> @@ -1954,19 +1953,15 @@ static int migration_cpu_stop(void *data)
>>  		if (pending) {
>>  			p->migration_pending = NULL;
>>  			complete = true;
>>  
>>  			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
>>  				goto out;
>>  		}
>>  
>>  		if (task_on_rq_queued(p))
>> +			rq = __migrate_task(rq, &rf, p, arg->dest_cpu);
>
>> @@ -2249,7 +2244,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>>  			init_completion(&my_pending.done);
>>  			my_pending.arg = (struct migration_arg) {
>>  				.task = p,
>> +				.dest_cpu = dest_cpu,
>>  				.pending = &my_pending,
>>  			};
>>  
>> @@ -2257,6 +2252,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>>  		} else {
>>  			pending = p->migration_pending;
>>  			refcount_inc(&pending->refs);
>> +			pending->arg.dest_cpu = dest_cpu;
>>  		}
>>  	}
>
> Argh.. that might just work. But I'm thinking we wants comments this
> time around :-) This is even more subtle.

Lemme stare at it some more and sharpen my quill then.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  2021-05-25 15:14 ` [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 Will Deacon
@ 2021-05-26 15:02   ` Peter Zijlstra
  2021-05-26 16:07     ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 15:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Li Zefan

On Tue, May 25, 2021 at 04:14:18PM +0100, Will Deacon wrote:
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
> +	const struct cpumask *cs_mask;
> +	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> +
>  	rcu_read_lock();
> +	cs_mask = task_cs(tsk)->cpus_allowed;
> +
> +	if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
> +		goto unlock; /* select_fallback_rq will try harder */
> +
> +	do_set_cpus_allowed(tsk, cs_mask);
> +unlock:

	if (is_in_v2_mode() && cpumask_subset(cs_mask, possible_mask))
		do_set_cpus_allowed(tsk, cs_mask);

perhaps?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  2021-05-25 15:14 ` [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask() Will Deacon
@ 2021-05-26 15:15   ` Peter Zijlstra
  2021-05-26 16:12     ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 15:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Tue, May 25, 2021 at 04:14:20PM +0100, Will Deacon wrote:
> Reject explicit requests to change the affinity mask of a task via
> set_cpus_allowed_ptr() if the requested mask is not a subset of the
> mask returned by task_cpu_possible_mask(). This ensures that the
> 'cpus_mask' for a given task cannot contain CPUs which are incapable of
> executing it, except in cases where the affinity is forced.
> 
> Reviewed-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  kernel/sched/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 00ed51528c70..8ca7854747f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2346,6 +2346,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>  				  u32 flags)
>  {
>  	const struct cpumask *cpu_valid_mask = cpu_active_mask;
> +	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
>  	unsigned int dest_cpu;
>  	struct rq_flags rf;
>  	struct rq *rq;
> @@ -2366,6 +2367,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>  		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.
>  		 */
>  		cpu_valid_mask = cpu_online_mask;
> +	} else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
> +		ret = -EINVAL;
> +		goto out;
>  	}

So what about the case where the 32bit task is in-kernel and in
migrate-disable ? surely we ought to still validate the new mask against
task_cpu_possible_mask.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support
  2021-05-25 18:11       ` Marc Zyngier
@ 2021-05-26 16:00         ` Will Deacon
  0 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-26 16:00 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Greg Kroah-Hartman, Peter Zijlstra, Morten Rasmussen,
	Qais Yousef, Suren Baghdasaryan, Quentin Perret, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Rafael J. Wysocki, Dietmar Eggemann, Daniel Bristot de Oliveira,
	kernel-team

On Tue, May 25, 2021 at 07:11:44PM +0100, Marc Zyngier wrote:
> On Tue, 25 May 2021 18:27:03 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > On Tue, May 25, 2021 at 06:13:58PM +0100, Marc Zyngier wrote:
> > > On Tue, 25 May 2021 16:14:32 +0100,
> > > Will Deacon <will@kernel.org> wrote:
> > > > 
> > > > Document support for running 32-bit tasks on asymmetric 32-bit systems
> > > > and its impact on the user ABI when enabled.
> > > > 
> > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > ---
> > > >  .../admin-guide/kernel-parameters.txt         |   3 +
> > > >  Documentation/arm64/asymmetric-32bit.rst      | 154 ++++++++++++++++++
> > > >  Documentation/arm64/index.rst                 |   1 +
> > > >  3 files changed, 158 insertions(+)
> > > >  create mode 100644 Documentation/arm64/asymmetric-32bit.rst
> > > >
> > > 
> > > [...]
> > > 
> > > > +KVM
> > > > +---
> > > > +
> > > > +Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
> > > > +asymmetric system, a broken guest at EL1 could still attempt to execute
> > > > +32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
> > > > +mode will return to host userspace with an ``exit_reason`` of
> > > > +``KVM_EXIT_FAIL_ENTRY``.
> > > 
> > > Nit: there is a bit more to it. The vcpu will be left in a permanent
> > > non-runnable state until KVM_ARM_VCPU_INIT is issued to reset the vcpu
> > > into a saner state.
> > 
> > Thanks, I'll add "and will remain non-runnable until re-initialised by a
> > subsequent KVM_ARM_VCPU_INIT operation".
> 
> Looks good.

Cheers.

> > Can the VMM tell that it needs to do that? I wonder if we should be
> > setting 'hardware_entry_failure_reason' to distinguish this case.
> 
> The VMM should be able to notice that something is amiss, as any
> subsequent KVM_RUN calls will result in -ENOEXEC being returned, and
> we document this as "the vcpu hasn't been initialized or the guest
> tried to execute instructions from device memory (arm64)".
> 
> However, there is another reason to get a "FAILED_ENTRY", and that if
> we get an Illegal Exception Return exception when entering the
> guest. That one should always be a KVM bug.
> 
> So yeah, maybe there is some ground to populate that structure with
> the appropriate nastygram (completely untested).
> 
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 24223adae150..cf50051a9412 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -402,6 +402,10 @@ struct kvm_vcpu_events {
>  #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
>  #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
>  
> +/* KVM_EXIT_FAIL_ENTRY reasons */
> +#define KVM_ARM64_FAILED_ENTRY_NO_AARCH32_ALLOWED	0xBADBAD32
> +#define KVM_ARM64_FAILED_ENTRY_INTERNAL_ERROR		0xE1215BAD

Heh, you and your magic numbers ;)

I'll leave it up to you as to whether you want to populate this -- I just
spotted it and thought it might help to indicate what went wrong. This is a
pretty daft situation to end up in so whether anybody would realistically
try to recover from it is another question entirely.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-26 11:14   ` Valentin Schneider
  2021-05-26 12:32     ` Peter Zijlstra
@ 2021-05-26 16:03     ` Will Deacon
  2021-05-26 17:46       ` Valentin Schneider
  1 sibling, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-26 16:03 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 12:14:20PM +0100, Valentin Schneider wrote:
> On 25/05/21 16:14, Will Deacon wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5226cc26a095..1702a60d178d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1869,6 +1869,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
> >  struct migration_arg {
> >       struct task_struct		*task;
> >       int				dest_cpu;
> > +	const struct cpumask		*dest_mask;
> >       struct set_affinity_pending	*pending;
> >  };
> >
> > @@ -1917,6 +1918,7 @@ static int migration_cpu_stop(void *data)
> >       struct set_affinity_pending *pending = arg->pending;
> >       struct task_struct *p = arg->task;
> >       int dest_cpu = arg->dest_cpu;
> > +	const struct cpumask *dest_mask = arg->dest_mask;
> >       struct rq *rq = this_rq();
> >       bool complete = false;
> >       struct rq_flags rf;
> > @@ -1956,12 +1958,8 @@ static int migration_cpu_stop(void *data)
> >                       complete = true;
> >               }
> >
> > -		if (dest_cpu < 0) {
> > -			if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask))
> > -				goto out;
> > -
> > -			dest_cpu = cpumask_any_distribute(&p->cpus_mask);
> > -		}
> > +		if (dest_mask && (cpumask_test_cpu(task_cpu(p), dest_mask)))
> > +			goto out;
> >
> 
> IIRC the reason we deferred the pick to migration_cpu_stop() was because of
> those insane races involving multiple SCA calls the likes of:
> 
>   p->cpus_mask = [0, 1]; p on CPU0
> 
>   CPUx                           CPUy                   CPU0
> 
>   SCA(p, [2])
>     __do_set_cpus_allowed();
>     queue migration_cpu_stop()
>                                  SCA(p, [3])
>                                    __do_set_cpus_allowed();
>                                                         migration_cpu_stop()
> 
> The stopper needs to use the latest cpumask set by the second SCA despite
> having an arg->pending set up by the first SCA. Doesn't this break here?

Yes, well spotted. I was so caught up with the hotplug race that I didn't
even consider a straightforward SCA race. Hurumph.

> I'm not sure I've paged back in all of the subtleties laying in ambush
> here, but what about the below?

I can't break it, but I'm also not very familiar with this code. Please can
you post it as a proper patch so that I drop this from my series?

Thanks,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
  2021-05-26 15:02   ` Peter Zijlstra
@ 2021-05-26 16:07     ` Will Deacon
  0 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-26 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Li Zefan

On Wed, May 26, 2021 at 05:02:20PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:18PM +0100, Will Deacon wrote:
> >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> >  {
> > +	const struct cpumask *cs_mask;
> > +	const struct cpumask *possible_mask = task_cpu_possible_mask(tsk);
> > +
> >  	rcu_read_lock();
> > +	cs_mask = task_cs(tsk)->cpus_allowed;
> > +
> > +	if (!is_in_v2_mode() || !cpumask_subset(cs_mask, possible_mask))
> > +		goto unlock; /* select_fallback_rq will try harder */
> > +
> > +	do_set_cpus_allowed(tsk, cs_mask);
> > +unlock:
> 
> 	if (is_in_v2_mode() && cpumask_subset(cs_mask, possible_mask))
> 		do_set_cpus_allowed(tsk, cs_mask);
> 
> perhaps?

Absolutely.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  2021-05-26 15:15   ` Peter Zijlstra
@ 2021-05-26 16:12     ` Will Deacon
  2021-05-26 17:56       ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-26 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 05:15:51PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:20PM +0100, Will Deacon wrote:
> > Reject explicit requests to change the affinity mask of a task via
> > set_cpus_allowed_ptr() if the requested mask is not a subset of the
> > mask returned by task_cpu_possible_mask(). This ensures that the
> > 'cpus_mask' for a given task cannot contain CPUs which are incapable of
> > executing it, except in cases where the affinity is forced.
> > 
> > Reviewed-by: Quentin Perret <qperret@google.com>
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  kernel/sched/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 00ed51528c70..8ca7854747f1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2346,6 +2346,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> >  				  u32 flags)
> >  {
> >  	const struct cpumask *cpu_valid_mask = cpu_active_mask;
> > +	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> >  	unsigned int dest_cpu;
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> > @@ -2366,6 +2367,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> >  		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.
> >  		 */
> >  		cpu_valid_mask = cpu_online_mask;
> > +	} else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
> > +		ret = -EINVAL;
> > +		goto out;
> >  	}
> 
> So what about the case where the 32bit task is in-kernel and in
> migrate-disable ? surely we ought to still validate the new mask against
> task_cpu_possible_mask.

That's a good question.

Given that 32-bit tasks in the kernel are running in 64-bit mode, we can
actually tolerate them moving around arbitrarily as long as they _never_ try
to return to userspace on a 64-bit-only CPU. I think this should be the case
as long as we don't try to return to userspace with migration disabled, no?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-25 15:14 ` [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems Will Deacon
@ 2021-05-26 16:20   ` Peter Zijlstra
  2021-05-26 16:35     ` Will Deacon
  2021-05-26 16:30   ` Peter Zijlstra
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 16:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Tue, May 25, 2021 at 04:14:23PM +0100, Will Deacon wrote:
> +static int restrict_cpus_allowed_ptr(struct task_struct *p,
> +				     struct cpumask *new_mask,
> +				     const struct cpumask *subset_mask)
> +{
> +	struct rq_flags rf;
> +	struct rq *rq;
> +	int err;
> +	struct cpumask *user_mask = NULL;
> +
> +	if (!p->user_cpus_ptr) {
> +		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);

		if (!user_mask)
			return -ENOMEM;
	}

?

> +
> +	rq = task_rq_lock(p, &rf);
> +
> +	/*
> +	 * Forcefully restricting the affinity of a deadline task is
> +	 * likely to cause problems, so fail and noisily override the
> +	 * mask entirely.
> +	 */
> +	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
> +		err = -EPERM;
> +		goto err_unlock;
> +	}
> +
> +	if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) {
> +		err = -EINVAL;
> +		goto err_unlock;
> +	}
> +
> +	/*
> +	 * We're about to butcher the task affinity, so keep track of what
> +	 * the user asked for in case we're able to restore it later on.
> +	 */
> +	if (user_mask) {
> +		cpumask_copy(user_mask, p->cpus_ptr);
> +		p->user_cpus_ptr = user_mask;
> +	}
> +
> +	return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf);
> +
> +err_unlock:
> +	task_rq_unlock(rq, p, &rf);
> +	kfree(user_mask);
> +	return err;
> +}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-25 15:14 ` [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems Will Deacon
  2021-05-26 16:20   ` Peter Zijlstra
@ 2021-05-26 16:30   ` Peter Zijlstra
  2021-05-26 17:02     ` Will Deacon
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 16:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Tue, May 25, 2021 at 04:14:23PM +0100, Will Deacon wrote:
> @@ -2426,20 +2421,166 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>  
>  	__do_set_cpus_allowed(p, new_mask, flags);
>  
> -	return affine_move_task(rq, p, &rf, dest_cpu, flags);
> +	if (flags & SCA_USER)
> +		release_user_cpus_ptr(p);
> +
> +	return affine_move_task(rq, p, rf, dest_cpu, flags);
>  
>  out:
> -	task_rq_unlock(rq, p, &rf);
> +	task_rq_unlock(rq, p, rf);
>  
>  	return ret;
>  }

So sys_sched_setaffinity() releases the user_cpus_ptr thingy ?! How does
that work?

I thought the intended semantics were somethings like:

	A - 0xff			B

	restrict(0xf) // user: 0xff eff: 0xf

					sched_setaffinity(A, 0x3c) // user: 0x3c eff: 0xc

	relax() // user: NULL, eff: 0x3c



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-26 16:20   ` Peter Zijlstra
@ 2021-05-26 16:35     ` Will Deacon
  0 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-26 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 06:20:25PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:23PM +0100, Will Deacon wrote:
> > +static int restrict_cpus_allowed_ptr(struct task_struct *p,
> > +				     struct cpumask *new_mask,
> > +				     const struct cpumask *subset_mask)
> > +{
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +	int err;
> > +	struct cpumask *user_mask = NULL;
> > +
> > +	if (!p->user_cpus_ptr) {
> > +		user_mask = kmalloc(cpumask_size(), GFP_KERNEL);
> 
> 		if (!user_mask)
> 			return -ENOMEM;
> 	}
> 
> ?

We won't blow up if we continue without user_mask here, but I agree that
it's more straightforward to return an error and have
force_compatible_cpus_allowed_ptr() noisily override the mask.

We're in pretty deep trouble if we're failing this allocation anyway.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-26 16:30   ` Peter Zijlstra
@ 2021-05-26 17:02     ` Will Deacon
  2021-05-27  7:56       ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-26 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 06:30:08PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:23PM +0100, Will Deacon wrote:
> > @@ -2426,20 +2421,166 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> >  
> >  	__do_set_cpus_allowed(p, new_mask, flags);
> >  
> > -	return affine_move_task(rq, p, &rf, dest_cpu, flags);
> > +	if (flags & SCA_USER)
> > +		release_user_cpus_ptr(p);
> > +
> > +	return affine_move_task(rq, p, rf, dest_cpu, flags);
> >  
> >  out:
> > -	task_rq_unlock(rq, p, &rf);
> > +	task_rq_unlock(rq, p, rf);
> >  
> >  	return ret;
> >  }
> 
> So sys_sched_setaffinity() releases the user_cpus_ptr thingy ?! How does
> that work?

Right, I think if the task explicitly changes its affinity then it makes
sense to forget about what it had before. It then behaves very similar to
CPU hotplug, which is the analogy I've been trying to follow: if you call
sched_setaffinity() with a mask containing offline CPUs then those CPUs
are not added back to the affinity mask when they are onlined.

> I thought the intended semantics were somethings like:
> 
> 	A - 0xff			B
> 
> 	restrict(0xf) // user: 0xff eff: 0xf
> 
> 					sched_setaffinity(A, 0x3c) // user: 0x3c eff: 0xc
> 
> 	relax() // user: NULL, eff: 0x3c

If you go down this route you can get into _really_ weird situations where
e.g. sys_sched_setaffinity() returns -EINVAL because the requested mask
contains only 64-bit-only cores, yet we've updated the user mask. It also
opens up some horrendous races between sched_setaffinity() and execve(),
since the former can transiently set an invalid mask per the cpuset
hierarchy.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination
  2021-05-26 16:03     ` Will Deacon
@ 2021-05-26 17:46       ` Valentin Schneider
  0 siblings, 0 replies; 57+ messages in thread
From: Valentin Schneider @ 2021-05-26 17:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Peter Zijlstra,
	Morten Rasmussen, Qais Yousef, Suren Baghdasaryan,
	Quentin Perret, Tejun Heo, Johannes Weiner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Rafael J. Wysocki, Dietmar Eggemann,
	Daniel Bristot de Oliveira, kernel-team

On 26/05/21 17:03, Will Deacon wrote:

> I can't break it, but I'm also not very familiar with this code. Please can
> you post it as a proper patch so that I drop this from my series?
>

I'm on it :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  2021-05-26 16:12     ` Will Deacon
@ 2021-05-26 17:56       ` Peter Zijlstra
  2021-05-26 18:59         ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-26 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 05:12:49PM +0100, Will Deacon wrote:
> On Wed, May 26, 2021 at 05:15:51PM +0200, Peter Zijlstra wrote:
> > On Tue, May 25, 2021 at 04:14:20PM +0100, Will Deacon wrote:
> > > Reject explicit requests to change the affinity mask of a task via
> > > set_cpus_allowed_ptr() if the requested mask is not a subset of the
> > > mask returned by task_cpu_possible_mask(). This ensures that the
> > > 'cpus_mask' for a given task cannot contain CPUs which are incapable of
> > > executing it, except in cases where the affinity is forced.
> > > 
> > > Reviewed-by: Quentin Perret <qperret@google.com>
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  kernel/sched/core.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 00ed51528c70..8ca7854747f1 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2346,6 +2346,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > >  				  u32 flags)
> > >  {
> > >  	const struct cpumask *cpu_valid_mask = cpu_active_mask;
> > > +	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> > >  	unsigned int dest_cpu;
> > >  	struct rq_flags rf;
> > >  	struct rq *rq;
> > > @@ -2366,6 +2367,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > >  		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.
> > >  		 */
> > >  		cpu_valid_mask = cpu_online_mask;
> > > +	} else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > >  	}
> > 
> > So what about the case where the 32bit task is in-kernel and in
> > migrate-disable ? surely we ought to still validate the new mask against
> > task_cpu_possible_mask.
> 
> That's a good question.
> 
> Given that 32-bit tasks in the kernel are running in 64-bit mode, we can
> actually tolerate them moving around arbitrarily as long as they _never_ try
> to return to userspace on a 64-bit-only CPU. I think this should be the case
> as long as we don't try to return to userspace with migration disabled, no?

Consider:

	8 CPUs, lower 4 have 32bit, higher 4 do not

	A - a 32 bit task		B

	sys_foo()
	  migrate_disable()
	  				sys_sched_setaffinity(A, 0xf0)
					  if (.. | migration_disabled(A))
					    // not checking nothing

					  __do_set_cpus_allowed();

	  migrate_enable()
	    __set_cpus_allowed(SCA_MIGRATE_ENABLE)
	      // frob outselves somewhere in 0xf0
	  sysret
	  *BOOM*


That is, I'm thinking we ought to disallow that sched_setaffinity() with
-EINVAL for 0xf0 has no intersection with 0x0f.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask()
  2021-05-26 17:56       ` Peter Zijlstra
@ 2021-05-26 18:59         ` Will Deacon
  0 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-26 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 07:56:51PM +0200, Peter Zijlstra wrote:
> On Wed, May 26, 2021 at 05:12:49PM +0100, Will Deacon wrote:
> > On Wed, May 26, 2021 at 05:15:51PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 25, 2021 at 04:14:20PM +0100, Will Deacon wrote:
> > > > Reject explicit requests to change the affinity mask of a task via
> > > > set_cpus_allowed_ptr() if the requested mask is not a subset of the
> > > > mask returned by task_cpu_possible_mask(). This ensures that the
> > > > 'cpus_mask' for a given task cannot contain CPUs which are incapable of
> > > > executing it, except in cases where the affinity is forced.
> > > > 
> > > > Reviewed-by: Quentin Perret <qperret@google.com>
> > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > ---
> > > >  kernel/sched/core.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 00ed51528c70..8ca7854747f1 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2346,6 +2346,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > > >  				  u32 flags)
> > > >  {
> > > >  	const struct cpumask *cpu_valid_mask = cpu_active_mask;
> > > > +	const struct cpumask *cpu_allowed_mask = task_cpu_possible_mask(p);
> > > >  	unsigned int dest_cpu;
> > > >  	struct rq_flags rf;
> > > >  	struct rq *rq;
> > > > @@ -2366,6 +2367,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > > >  		 * set_cpus_allowed_common() and actually reset p->cpus_ptr.
> > > >  		 */
> > > >  		cpu_valid_mask = cpu_online_mask;
> > > > +	} else if (!cpumask_subset(new_mask, cpu_allowed_mask)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out;
> > > >  	}
> > > 
> > > So what about the case where the 32bit task is in-kernel and in
> > > migrate-disable ? surely we ought to still validate the new mask against
> > > task_cpu_possible_mask.
> > 
> > That's a good question.
> > 
> > Given that 32-bit tasks in the kernel are running in 64-bit mode, we can
> > actually tolerate them moving around arbitrarily as long as they _never_ try
> > to return to userspace on a 64-bit-only CPU. I think this should be the case
> > as long as we don't try to return to userspace with migration disabled, no?
> 
> Consider:
> 
> 	8 CPUs, lower 4 have 32bit, higher 4 do not
> 
> 	A - a 32 bit task		B
> 
> 	sys_foo()
> 	  migrate_disable()
> 	  				sys_sched_setaffinity(A, 0xf0)
> 					  if (.. | migration_disabled(A))
> 					    // not checking nothing
> 
> 					  __do_set_cpus_allowed();
> 
> 	  migrate_enable()
> 	    __set_cpus_allowed(SCA_MIGRATE_ENABLE)
> 	      // frob outselves somewhere in 0xf0
> 	  sysret
> 	  *BOOM*
> 
> 
> That is, I'm thinking we ought to disallow that sched_setaffinity() with
> -EINVAL for 0xf0 has no intersection with 0x0f.

I *think* the cpuset_cpus_allowed() check in sys_sched_setaffinity()
will save us here by reducing the 0xf0 mask to 0x0 early on. However,
after seeing this example I'd be much happier checking this in SCA anyway so
I'll add that for the next version!

Thanks,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems
  2021-05-26 17:02     ` Will Deacon
@ 2021-05-27  7:56       ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-27  7:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Wed, May 26, 2021 at 06:02:06PM +0100, Will Deacon wrote:
> On Wed, May 26, 2021 at 06:30:08PM +0200, Peter Zijlstra wrote:
> > On Tue, May 25, 2021 at 04:14:23PM +0100, Will Deacon wrote:
> > > @@ -2426,20 +2421,166 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > >  
> > >  	__do_set_cpus_allowed(p, new_mask, flags);
> > >  
> > > -	return affine_move_task(rq, p, &rf, dest_cpu, flags);
> > > +	if (flags & SCA_USER)
> > > +		release_user_cpus_ptr(p);
> > > +
> > > +	return affine_move_task(rq, p, rf, dest_cpu, flags);
> > >  
> > >  out:
> > > -	task_rq_unlock(rq, p, &rf);
> > > +	task_rq_unlock(rq, p, rf);
> > >  
> > >  	return ret;
> > >  }
> > 
> > So sys_sched_setaffinity() releases the user_cpus_ptr thingy ?! How does
> > that work?
> 
> Right, I think if the task explicitly changes its affinity then it makes
> sense to forget about what it had before. It then behaves very similar to
> CPU hotplug, which is the analogy I've been trying to follow: if you call
> sched_setaffinity() with a mask containing offline CPUs then those CPUs
> are not added back to the affinity mask when they are onlined.

Oh right, crap semantics all the way down :/ I always forget how
horrible they are.

You're right though; this is consistent with the current mess.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-25 15:14 ` [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks Will Deacon
@ 2021-05-27 14:10   ` Peter Zijlstra
  2021-05-27 14:31     ` Peter Zijlstra
  2021-05-27 14:36     ` Will Deacon
  2021-06-01  8:21   ` [RFC][PATCH] freezer,sched: Rewrite core freezer logic Peter Zijlstra
  1 sibling, 2 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-27 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Tue, May 25, 2021 at 04:14:26PM +0100, Will Deacon wrote:
> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index dc520f01f99d..8f3d950c2a87 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -11,6 +11,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/freezer.h>
>  #include <linux/kthread.h>
> +#include <linux/mmu_context.h>
>  
>  /* total number of freezing conditions in effect */
>  atomic_t system_freezing_cnt = ATOMIC_INIT(0);
> @@ -146,9 +147,16 @@ bool freeze_task(struct task_struct *p)
>  void __thaw_task(struct task_struct *p)
>  {
>  	unsigned long flags;
> +	const struct cpumask *mask = task_cpu_possible_mask(p);
>  
>  	spin_lock_irqsave(&freezer_lock, flags);
> -	if (frozen(p))
> +	/*
> +	 * Wake up frozen tasks. On asymmetric systems where tasks cannot
> +	 * run on all CPUs, ttwu() may have deferred a wakeup generated
> +	 * before thaw_secondary_cpus() had completed so we generate
> +	 * additional wakeups here for tasks in the PF_FREEZER_SKIP state.
> +	 */
> +	if (frozen(p) || (frozen_or_skipped(p) && mask != cpu_possible_mask))
>  		wake_up_process(p);
>  	spin_unlock_irqrestore(&freezer_lock, flags);
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 42e2aecf087c..6cb9677d635a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3529,6 +3529,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	if (!(p->state & state))
>  		goto unlock;
>  
> +#ifdef CONFIG_FREEZER
> +	/*
> +	 * If we're going to wake up a thread which may be frozen, then
> +	 * we can only do so if we have an active CPU which is capable of
> +	 * running it. This may not be the case when resuming from suspend,
> +	 * as the secondary CPUs may not yet be back online. See __thaw_task()
> +	 * for the actual wakeup.
> +	 */
> +	if (unlikely(frozen_or_skipped(p)) &&
> +	    !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
> +		goto unlock;
> +#endif
> +
>  	trace_sched_waking(p);
>  
>  	/* We're going to change ->state: */

OK, I really hate this. This is slowing down the very hot wakeup path
for the silly freezer that *never* happens. Let me try and figure out if
there's another option.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:10   ` Peter Zijlstra
@ 2021-05-27 14:31     ` Peter Zijlstra
  2021-05-27 14:44       ` Will Deacon
                         ` (2 more replies)
  2021-05-27 14:36     ` Will Deacon
  1 sibling, 3 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-27 14:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 04:10:16PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:26PM +0100, Will Deacon wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 42e2aecf087c..6cb9677d635a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3529,6 +3529,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	if (!(p->state & state))
> >  		goto unlock;
> >  
> > +#ifdef CONFIG_FREEZER
> > +	/*
> > +	 * If we're going to wake up a thread which may be frozen, then
> > +	 * we can only do so if we have an active CPU which is capable of
> > +	 * running it. This may not be the case when resuming from suspend,
> > +	 * as the secondary CPUs may not yet be back online. See __thaw_task()
> > +	 * for the actual wakeup.
> > +	 */
> > +	if (unlikely(frozen_or_skipped(p)) &&
> > +	    !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
> > +		goto unlock;
> > +#endif
> > +
> >  	trace_sched_waking(p);
> >  
> >  	/* We're going to change ->state: */
> 
> OK, I really hate this. This is slowing down the very hot wakeup path
> for the silly freezer that *never* happens. Let me try and figure out if
> there's another option.


How's something *completely* untested like this?

---
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..44ece41c3db3 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -24,7 +24,7 @@ extern unsigned int freeze_timeout_msecs;
  */
 static inline bool frozen(struct task_struct *p)
 {
-	return p->flags & PF_FROZEN;
+	return p->state == TASK_FROZEN;
 }
 
 extern bool freezing_slow_path(struct task_struct *p);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2982cfab1ae9..7e7775c5b742 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,7 +95,8 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+#define TASK_FROZEN			0x1000
+#define TASK_STATE_MAX			0x2000
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1579,7 +1580,6 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USED_ASYNC		0x00004000	/* Used async_schedule*(), used by module init */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..be6c86078510 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -63,18 +63,13 @@ bool __refrigerator(bool check_kthr_stop)
 	pr_debug("%s entered refrigerator\n", current->comm);
 
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		if (freezing(current) && (check_kthr_stop && kthread_should_stop())) {
+			set_current_state(TASK_FROZEN);
+			was_frozen = true;
+		}
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
-			break;
-		was_frozen = true;
 		schedule();
 	}
 
@@ -149,7 +144,7 @@ void __thaw_task(struct task_struct *p)
 
 	spin_lock_irqsave(&freezer_lock, flags);
 	if (frozen(p))
-		wake_up_process(p);
+		wake_up_state(p, TASK_FROZEN);
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..71a6509f8d4f 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely((t->flags & PF_FREEZER_SKIP) || frozen(t)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 401f012349d1..1eedf6a044f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5889,7 +5889,7 @@ static void __sched notrace __schedule(bool preempt)
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:10   ` Peter Zijlstra
  2021-05-27 14:31     ` Peter Zijlstra
@ 2021-05-27 14:36     ` Will Deacon
  1 sibling, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-05-27 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 04:10:16PM +0200, Peter Zijlstra wrote:
> On Tue, May 25, 2021 at 04:14:26PM +0100, Will Deacon wrote:
> > diff --git a/kernel/freezer.c b/kernel/freezer.c
> > index dc520f01f99d..8f3d950c2a87 100644
> > --- a/kernel/freezer.c
> > +++ b/kernel/freezer.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/freezer.h>
> >  #include <linux/kthread.h>
> > +#include <linux/mmu_context.h>
> >  
> >  /* total number of freezing conditions in effect */
> >  atomic_t system_freezing_cnt = ATOMIC_INIT(0);
> > @@ -146,9 +147,16 @@ bool freeze_task(struct task_struct *p)
> >  void __thaw_task(struct task_struct *p)
> >  {
> >  	unsigned long flags;
> > +	const struct cpumask *mask = task_cpu_possible_mask(p);
> >  
> >  	spin_lock_irqsave(&freezer_lock, flags);
> > -	if (frozen(p))
> > +	/*
> > +	 * Wake up frozen tasks. On asymmetric systems where tasks cannot
> > +	 * run on all CPUs, ttwu() may have deferred a wakeup generated
> > +	 * before thaw_secondary_cpus() had completed so we generate
> > +	 * additional wakeups here for tasks in the PF_FREEZER_SKIP state.
> > +	 */
> > +	if (frozen(p) || (frozen_or_skipped(p) && mask != cpu_possible_mask))
> >  		wake_up_process(p);
> >  	spin_unlock_irqrestore(&freezer_lock, flags);
> >  }
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 42e2aecf087c..6cb9677d635a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3529,6 +3529,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >  	if (!(p->state & state))
> >  		goto unlock;
> >  
> > +#ifdef CONFIG_FREEZER
> > +	/*
> > +	 * If we're going to wake up a thread which may be frozen, then
> > +	 * we can only do so if we have an active CPU which is capable of
> > +	 * running it. This may not be the case when resuming from suspend,
> > +	 * as the secondary CPUs may not yet be back online. See __thaw_task()
> > +	 * for the actual wakeup.
> > +	 */
> > +	if (unlikely(frozen_or_skipped(p)) &&
> > +	    !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
> > +		goto unlock;
> > +#endif
> > +
> >  	trace_sched_waking(p);
> >  
> >  	/* We're going to change ->state: */
> 
> OK, I really hate this. This is slowing down the very hot wakeup path
> for the silly freezer that *never* happens. Let me try and figure out if
> there's another option.

I'm definitely open to alternative suggestions here. An easy thing to do
would be move the 'flags' field up in task_struct so that the previous
access to state pulls it in for us.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:31     ` Peter Zijlstra
@ 2021-05-27 14:44       ` Will Deacon
  2021-05-27 14:55         ` Peter Zijlstra
  2021-05-27 14:50       ` Peter Zijlstra
  2021-05-28 10:49       ` Peter Zijlstra
  2 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-05-27 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 04:31:51PM +0200, Peter Zijlstra wrote:
> On Thu, May 27, 2021 at 04:10:16PM +0200, Peter Zijlstra wrote:
> > On Tue, May 25, 2021 at 04:14:26PM +0100, Will Deacon wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 42e2aecf087c..6cb9677d635a 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3529,6 +3529,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > >  	if (!(p->state & state))
> > >  		goto unlock;
> > >  
> > > +#ifdef CONFIG_FREEZER
> > > +	/*
> > > +	 * If we're going to wake up a thread which may be frozen, then
> > > +	 * we can only do so if we have an active CPU which is capable of
> > > +	 * running it. This may not be the case when resuming from suspend,
> > > +	 * as the secondary CPUs may not yet be back online. See __thaw_task()
> > > +	 * for the actual wakeup.
> > > +	 */
> > > +	if (unlikely(frozen_or_skipped(p)) &&
> > > +	    !cpumask_intersects(cpu_active_mask, task_cpu_possible_mask(p)))
> > > +		goto unlock;
> > > +#endif
> > > +
> > >  	trace_sched_waking(p);
> > >  
> > >  	/* We're going to change ->state: */
> > 
> > OK, I really hate this. This is slowing down the very hot wakeup path
> > for the silly freezer that *never* happens. Let me try and figure out if
> > there's another option.
> 
> 
> How's something *completely* untested like this?

I'm not seeing how this handles tasks which weren't put in the freezer
because they have PF_FREEZER_SKIP set. For these tasks, we need to make
sure that they don't become runnable before we have onlined a core which
is capable of running them, and this could occur because of any old
wakeup (i.e. whatever it was that they blocked on).

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:31     ` Peter Zijlstra
  2021-05-27 14:44       ` Will Deacon
@ 2021-05-27 14:50       ` Peter Zijlstra
  2021-05-28 10:49       ` Peter Zijlstra
  2 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-27 14:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 04:31:51PM +0200, Peter Zijlstra wrote:
> @@ -149,7 +144,7 @@ void __thaw_task(struct task_struct *p)
>  
>  	spin_lock_irqsave(&freezer_lock, flags);
>  	if (frozen(p))
> -		wake_up_process(p);
> +		wake_up_state(p, TASK_FROZEN);

Possibly, that wants | TASK_NORMAL added.

>  	spin_unlock_irqrestore(&freezer_lock, flags);
>  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:44       ` Will Deacon
@ 2021-05-27 14:55         ` Peter Zijlstra
  0 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-27 14:55 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 03:44:12PM +0100, Will Deacon wrote:
> I'm not seeing how this handles tasks which weren't put in the freezer
> because they have PF_FREEZER_SKIP set. For these tasks, we need to make
> sure that they don't become runnable before we have onlined a core which
> is capable of running them, and this could occur because of any old
> wakeup (i.e. whatever it was that they blocked on).

I was under the impression that userspace tasks could not have
PF_FREEZER_SKIP set.. let me look harder.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks
  2021-05-27 14:31     ` Peter Zijlstra
  2021-05-27 14:44       ` Will Deacon
  2021-05-27 14:50       ` Peter Zijlstra
@ 2021-05-28 10:49       ` Peter Zijlstra
  2 siblings, 0 replies; 57+ messages in thread
From: Peter Zijlstra @ 2021-05-28 10:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Thu, May 27, 2021 at 04:31:51PM +0200, Peter Zijlstra wrote:
> How's something *completely* untested like this?

I now have the below; which builds defconfig -cgroup-freezer. I've yet
to boot and test.

---
 include/linux/freezer.h | 53 +++++++++++---------------------------
 include/linux/sched.h   |  6 ++---
 init/do_mounts_initrd.c |  7 ++----
 kernel/freezer.c        | 67 ++++++++++++++++++++++++++++++-------------------
 kernel/hung_task.c      |  4 +--
 kernel/power/main.c     |  5 ++--
 kernel/power/process.c  | 10 +++-----
 kernel/sched/core.c     |  2 +-
 8 files changed, 70 insertions(+), 84 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..cdc65bc1e081 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -24,7 +26,7 @@ extern unsigned int freeze_timeout_msecs;
  */
 static inline bool frozen(struct task_struct *p)
 {
-	return p->flags & PF_FROZEN;
+	return p->state == TASK_FROZEN;
 }
 
 extern bool freezing_slow_path(struct task_struct *p);
@@ -34,9 +36,10 @@ extern bool freezing_slow_path(struct task_struct *p);
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -92,6 +95,8 @@ static inline bool cgroup_freezing(struct task_struct *task)
  * waking up the parent.
  */
 
+extern void __freezer_do_not_count(void);
+extern void __freezer_count(void);
 
 /**
  * freezer_do_not_count - tell freezer to ignore %current
@@ -106,7 +111,8 @@ static inline bool cgroup_freezing(struct task_struct *task)
  */
 static inline void freezer_do_not_count(void)
 {
-	current->flags |= PF_FREEZER_SKIP;
+	if (static_branch_unlikely(&freezer_active))
+		__freezer_do_not_count();
 }
 
 /**
@@ -118,47 +124,17 @@ static inline void freezer_do_not_count(void)
  */
 static inline void freezer_count(void)
 {
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
+	if (static_branch_unlikely(&freezer_active))
+		__freezer_count();
 }
 
 /* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
 static inline void freezer_count_unsafe(void)
 {
-	current->flags &= ~PF_FREEZER_SKIP;
 	smp_mb();
 	try_to_freeze_unsafe();
 }
 
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
 /*
  * These functions are intended to be used whenever you want allow a sleeping
  * task to be frozen. Note that neither return any clear indication of
@@ -283,7 +259,6 @@ static inline bool try_to_freeze(void) { return false; }
 
 static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
 #define freezable_schedule()  schedule()
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c881384517..744dc617dbea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,7 +95,9 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+#define TASK_MAY_FREEZE			0x1000
+#define TASK_FROZEN			0x2000
+#define TASK_STATE_MAX			0x4000
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1572,7 +1574,6 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USED_ASYNC		0x00004000	/* Used async_schedule*(), used by module init */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1584,7 +1585,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 533d81ed74d4..2f1227053694 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -80,10 +80,9 @@ static void __init handle_initrd(void)
 	init_chdir("/old");
 
 	/*
-	 * In case that a resume from disk is carried out by linuxrc or one of
-	 * its children, we need to tell the freezer not to wait for us.
+	 * PF_FREEZER_SKIP is pointless because UMH is laundered through a
+	 * workqueue and that doesn't have the current context.
 	 */
-	current->flags |= PF_FREEZER_SKIP;
 
 	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
 					 GFP_KERNEL, init_linuxrc, NULL, NULL);
@@ -91,8 +90,6 @@ static void __init handle_initrd(void)
 		return;
 	call_usermodehelper_exec(info, UMH_WAIT_PROC);
 
-	current->flags &= ~PF_FREEZER_SKIP;
-
 	/* move initrd to rootfs' /old */
 	init_mount("..", ".", NULL, MS_MOVE, NULL);
 	/* switch root and cwd back to / of rootfs */
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..e316008042df 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,8 +13,8 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
 /* indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
@@ -29,7 +29,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -63,18 +63,13 @@ bool __refrigerator(bool check_kthr_stop)
 	pr_debug("%s entered refrigerator\n", current->comm);
 
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		if (freezing(current) && (check_kthr_stop && kthread_should_stop())) {
+			set_current_state(TASK_FROZEN);
+			was_frozen = true;
+		}
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
-			break;
-		was_frozen = true;
 		schedule();
 	}
 
@@ -101,6 +96,38 @@ static void fake_signal_wake_up(struct task_struct *p)
 	}
 }
 
+void __freezer_do_not_count(void)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	if (current->state)
+		current->state |= TASK_MAY_FREEZE;
+	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+}
+EXPORT_SYMBOL(__freezer_do_not_count);
+
+void __freezer_count(void)
+{
+	try_to_freeze();
+}
+EXPORT_SYMBOL(__freezer_count);
+
+static bool __freeze_task(struct task_struct *p)
+{
+	unsigned long flags;
+	bool frozen = false;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	if (p->state & TASK_MAY_FREEZE) {
+		p->state = TASK_FROZEN;
+		frozen = true;
+	}
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return frozen;
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +143,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -149,7 +164,7 @@ void __thaw_task(struct task_struct *p)
 
 	spin_lock_irqsave(&freezer_lock, flags);
 	if (frozen(p))
-		wake_up_process(p);
+		wake_up_state(p, TASK_FROZEN);
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..154cac8b805a 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(t->state & (TASK_MAY_FREEZE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 12c7e1bb442f..0817e3913294 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -23,7 +23,8 @@
 
 void lock_system_sleep(void)
 {
-	current->flags |= PF_FREEZER_SKIP;
+	WARN_ON_ONCE(current->flags & PF_NOFREEZE);
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(lock_system_sleep);
@@ -46,7 +47,7 @@ void unlock_system_sleep(void)
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	current->flags &= ~PF_FREEZER_SKIP;
+	current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 50cc63534486..36dee2dcfeab 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user_only)
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user_only)
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -132,7 +130,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(true);
 	pr_info("Freezing user space processes ... ");
@@ -193,7 +191,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..6bc5df6f8d73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5082,7 +5082,7 @@ static void __sched notrace __schedule(bool preempt)
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-05-25 15:14 ` [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks Will Deacon
  2021-05-27 14:10   ` Peter Zijlstra
@ 2021-06-01  8:21   ` Peter Zijlstra
  2021-06-01 11:27     ` Peter Zijlstra
  1 sibling, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-06-01  8:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team


Hi,

This here rewrites the core freezer to behave better wrt thawing. By
replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until woken and don't randomly wake up
early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP (yay).

It does however completely wreck kernel/cgroup/legacy_freezer.c and I've
not yet spend any time on trying to figure out that code, will do so
shortly.

Other than that, the freezer seems to work fine, I've tested it with:

  echo freezer > /sys/power/pm_test
  echo mem > /sys/power/state

Even while having a GDB session active, and that all works.

Another notable bit is in init/do_mounts_initrd.c; afaict that has been
'broken' for quite a while and is simply removed.

Please have a look.

Somewhat-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 drivers/android/binder.c     |   4 +-
 drivers/media/pci/pt3/pt3.c  |   4 +-
 fs/cifs/inode.c              |   4 +-
 fs/cifs/transport.c          |   5 +-
 fs/coredump.c                |   4 +-
 fs/nfs/file.c                |   3 +-
 fs/nfs/inode.c               |  12 +--
 fs/nfs/nfs4proc.c            |  14 +--
 fs/nfs/nfs4state.c           |   3 +-
 fs/nfs/pnfs.c                |   4 +-
 fs/xfs/xfs_trans_ail.c       |   8 +-
 include/linux/completion.h   |   2 +
 include/linux/freezer.h      | 244 ++-----------------------------------------
 include/linux/sched.h        |   9 +-
 include/linux/sunrpc/sched.h |   7 +-
 include/linux/wait.h         |  39 +++++--
 init/do_mounts_initrd.c      |   7 +-
 kernel/exit.c                |   4 +-
 kernel/fork.c                |   4 +-
 kernel/freezer.c             | 102 ++++++++++++------
 kernel/futex.c               |   4 +-
 kernel/hung_task.c           |   4 +-
 kernel/power/main.c          |   5 +-
 kernel/power/process.c       |  10 +-
 kernel/sched/completion.c    |  16 +++
 kernel/sched/core.c          |   2 +-
 kernel/signal.c              |  14 +--
 kernel/time/hrtimer.c        |   4 +-
 mm/khugepaged.c              |   4 +-
 net/sunrpc/sched.c           |  12 +--
 net/unix/af_unix.c           |   8 +-
 31 files changed, 202 insertions(+), 364 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index bcec598b89f2..70b1dbf2c2b1 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -3712,10 +3712,9 @@ static int binder_wait_for_work(struct binder_thread *thread,
 	struct binder_proc *proc = thread->proc;
 	int ret = 0;
 
-	freezer_do_not_count();
 	binder_inner_proc_lock(proc);
 	for (;;) {
-		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		if (binder_has_work_ilocked(thread, do_proc_work))
 			break;
 		if (do_proc_work)
@@ -3732,7 +3731,6 @@ static int binder_wait_for_work(struct binder_thread *thread,
 	}
 	finish_wait(&thread->wait, &wait);
 	binder_inner_proc_unlock(proc);
-	freezer_count();
 
 	return ret;
 }
diff --git a/drivers/media/pci/pt3/pt3.c b/drivers/media/pci/pt3/pt3.c
index c0bc86793355..cc3cfbc0e33c 100644
--- a/drivers/media/pci/pt3/pt3.c
+++ b/drivers/media/pci/pt3/pt3.c
@@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
 		pt3_proc_dma(adap);
 
 		delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		freezable_schedule_hrtimeout_range(&delay,
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_hrtimeout_range(&delay,
 					PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
 					HRTIMER_MODE_REL);
 	}
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 1dfa57982522..8fccf35392ec 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2280,7 +2280,7 @@ cifs_invalidate_mapping(struct inode *inode)
 static int
 cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -2297,7 +2297,7 @@ cifs_revalidate_mapping(struct inode *inode)
 		return 0;
 
 	rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
-				     TASK_KILLABLE);
+				     TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (rc)
 		return rc;
 
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index c1725b55f364..780aea1002e2 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -771,8 +771,9 @@ wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ)
 {
 	int error;
 
-	error = wait_event_freezekillable_unsafe(server->response_q,
-				    midQ->mid_state != MID_REQUEST_SUBMITTED);
+	error = wait_event_state(server->response_q,
+				 midQ->mid_state != MID_REQUEST_SUBMITTED,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (error < 0)
 		return -ERESTARTSYS;
 
diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..50d5db61e6e0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -465,9 +465,7 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
-		freezer_do_not_count();
-		wait_for_completion(&core_state->startup);
-		freezer_count();
+		wait_for_completion_freezable(&core_state->startup);
 		/*
 		 * Wait for all the threads to become inactive, so that
 		 * all the thread context (extended register state, like
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1fef107961bc..41e4bb19023d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -558,7 +558,8 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf)
 	nfs_fscache_wait_on_page_write(NFS_I(inode), page);
 
 	wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
+			   nfs_wait_bit_killable,
+			   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 
 	lock_page(page);
 	mapping = page_file_mapping(page);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 529c4099f482..df97984e9e5d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
 	return nfs_fileid_to_ino_t(fattr->fileid);
 }
 
-static int nfs_wait_killable(int mode)
+int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
 }
-
-int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
-{
-	return nfs_wait_killable(mode);
-}
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
 /**
@@ -1323,7 +1318,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
 	 */
 	for (;;) {
 		ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
-					 nfs_wait_bit_killable, TASK_KILLABLE);
+					 nfs_wait_bit_killable,
+					 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (ret)
 			goto out;
 		spin_lock(&inode->i_lock);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 87d04f2c9385..925c8bc217b5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -411,8 +411,8 @@ static int nfs4_delay_killable(long *timeout)
 {
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
+	__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!__fatal_signal_pending(current))
 		return 0;
 	return -EINTR;
@@ -422,7 +422,8 @@ static int nfs4_delay_interruptible(long *timeout)
 {
 	might_sleep();
 
-	freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!signal_pending(current))
 		return 0;
 	return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
@@ -7280,7 +7281,8 @@ nfs4_retry_setlk_simple(struct nfs4_state *state, int cmd,
 		status = nfs4_proc_setlk(state, cmd, request);
 		if ((status != -EAGAIN) || IS_SETLK(cmd))
 			break;
-		freezable_schedule_timeout_interruptible(timeout);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_timeout(timeout);
 		timeout *= 2;
 		timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
 		status = -ERESTARTSYS;
@@ -7348,10 +7350,8 @@ nfs4_retry_setlk(struct nfs4_state *state, int cmd, struct file_lock *request)
 			break;
 
 		status = -ERESTARTSYS;
-		freezer_do_not_count();
-		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
+		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 			   NFS4_LOCK_MAXTIMEOUT);
-		freezer_count();
 	} while (!signalled());
 
 	remove_wait_queue(q, &waiter.wait);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f22818a80c2c..37f1aea98f8f 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1307,7 +1307,8 @@ int nfs4_wait_clnt_recover(struct nfs_client *clp)
 
 	refcount_inc(&clp->cl_count);
 	res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
-				 nfs_wait_bit_killable, TASK_KILLABLE);
+				 nfs_wait_bit_killable,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (res)
 		goto out;
 	if (clp->cl_cons_state < 0)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 03e0b34c4a64..807d8d56a09a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1898,7 +1898,7 @@ static int pnfs_prepare_to_retry_layoutget(struct pnfs_layout_hdr *lo)
 	pnfs_layoutcommit_inode(lo->plh_inode, false);
 	return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
 				   nfs_wait_bit_killable,
-				   TASK_KILLABLE);
+				   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
 
 static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
@@ -3173,7 +3173,7 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 		status = wait_on_bit_lock_action(&nfsi->flags,
 				NFS_INO_LAYOUTCOMMITTING,
 				nfs_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (status)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..bbf135df69c5 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -585,9 +585,9 @@ xfsaild(
 
 	while (1) {
 		if (tout && tout <= 20)
-			set_current_state(TASK_KILLABLE);
+			set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
 		else
-			set_current_state(TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 		/*
 		 * Check kthread_should_stop() after we set the task state to
@@ -636,14 +636,14 @@ xfsaild(
 		    ailp->ail_target == ailp->ail_target_prev &&
 		    list_empty(&ailp->ail_buf_list)) {
 			spin_unlock(&ailp->ail_lock);
-			freezable_schedule();
+			schedule();
 			tout = 0;
 			continue;
 		}
 		spin_unlock(&ailp->ail_lock);
 
 		if (tout)
-			freezable_schedule_timeout(msecs_to_jiffies(tout));
+			schedule_timeout(msecs_to_jiffies(tout));
 
 		__set_current_state(TASK_RUNNING);
 
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 51d9ab079629..76f8b5dac648 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -100,9 +100,11 @@ static inline void reinit_completion(struct completion *x)
 }
 
 extern void wait_for_completion(struct completion *);
+extern void wait_for_completion_freezable(struct completion *);
 extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
+extern int wait_for_completion_killable_freezable(struct completion *x);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
 						   unsigned long timeout);
 extern unsigned long wait_for_completion_io_timeout(struct completion *x,
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..c1cece97fd82 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs;
 /*
  * Check if a process has been frozen
  */
-static inline bool frozen(struct task_struct *p)
-{
-	return p->flags & PF_FROZEN;
-}
+extern bool frozen(struct task_struct *p);
 
 extern bool freezing_slow_path(struct task_struct *p);
 
@@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct task_struct *p);
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
 extern void thaw_kernel_threads(void);
 
-/*
- * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
- * If try_to_freeze causes a lockdep warning it means the caller may deadlock
- */
-static inline bool try_to_freeze_unsafe(void)
+static inline bool try_to_freeze(void)
 {
 	might_sleep();
 	if (likely(!freezing(current)))
 		return false;
-	return __refrigerator(false);
-}
-
-static inline bool try_to_freeze(void)
-{
 	if (!(current->flags & PF_NOFREEZE))
 		debug_check_no_locks_held();
-	return try_to_freeze_unsafe();
+	return __refrigerator(false);
 }
 
 extern bool freeze_task(struct task_struct *p);
@@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struct task_struct *task)
 }
 #endif /* !CONFIG_CGROUP_FREEZER */
 
-/*
- * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
- * calls wait_for_completion(&vfork) and reset right after it returns from this
- * function.  Next, the parent should call try_to_freeze() to freeze itself
- * appropriately in case the child has exited before the freezing of tasks is
- * complete.  However, we don't want kernel threads to be frozen in unexpected
- * places, so we allow them to block freeze_processes() instead or to set
- * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
- * parent won't really block freeze_processes(), since ____call_usermodehelper()
- * (the child) does a little before exec/exit and it can't be frozen before
- * waking up the parent.
- */
-
-
-/**
- * freezer_do_not_count - tell freezer to ignore %current
- *
- * Tell freezers to ignore the current task when determining whether the
- * target frozen state is reached.  IOW, the current task will be
- * considered frozen enough by freezers.
- *
- * The caller shouldn't do anything which isn't allowed for a frozen task
- * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
- * wrap a scheduling operation and nothing much else.
- */
-static inline void freezer_do_not_count(void)
-{
-	current->flags |= PF_FREEZER_SKIP;
-}
-
-/**
- * freezer_count - tell freezer to stop ignoring %current
- *
- * Undo freezer_do_not_count().  It tells freezers that %current should be
- * considered again and tries to freeze if freezing condition is already in
- * effect.
- */
-static inline void freezer_count(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezer_count_unsafe(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	smp_mb();
-	try_to_freeze_unsafe();
-}
-
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
-/*
- * These functions are intended to be used whenever you want allow a sleeping
- * task to be frozen. Note that neither return any clear indication of
- * whether a freeze event happened while in this function.
- */
-
-/* Like schedule(), but should not block the freezer. */
-static inline void freezable_schedule(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezable_schedule_unsafe(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count_unsafe();
-}
-
-/*
- * Like schedule_timeout(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout_interruptible(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
-{
-	long __retval;
-
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/* Like schedule_timeout_killable(), but should not block the freezer. */
-static inline long freezable_schedule_timeout_killable(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/*
- * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
-		u64 delta, const enum hrtimer_mode mode)
-{
-	int __retval;
-	freezer_do_not_count();
-	__retval = schedule_hrtimeout_range(expires, delta, mode);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Freezer-friendly wrappers around wait_event_interruptible(),
- * wait_event_killable() and wait_event_interruptible_timeout(), originally
- * defined in <linux/wait.h>
- */
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-({									\
-	int __retval;							\
-	freezer_do_not_count();						\
-	__retval = wait_event_killable(wq, (condition));		\
-	freezer_count_unsafe();						\
-	__retval;							\
-})
-
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
@@ -281,35 +83,9 @@ static inline void thaw_kernel_threads(void) {}
 
 static inline bool try_to_freeze(void) { return false; }
 
-static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
-#define freezable_schedule()  schedule()
-
-#define freezable_schedule_unsafe()  schedule()
-
-#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
-
-#define freezable_schedule_timeout_interruptible(timeout)		\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_interruptible_unsafe(timeout)	\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_killable(timeout)			\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_timeout_killable_unsafe(timeout)		\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_hrtimeout_range(expires, delta, mode)	\
-	schedule_hrtimeout_range(expires, delta, mode)
-
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-		wait_event_killable(wq, condition)
-
 #endif /* !CONFIG_FREEZER */
 
 #endif	/* FREEZER_H_INCLUDED */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2982cfab1ae9..bfadc1dbcf24 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,7 +95,12 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+#define TASK_FREEZABLE			0x1000
+#define __TASK_FREEZABLE_UNSAFE		0x2000
+#define TASK_FROZEN			0x4000
+#define TASK_STATE_MAX			0x8000
+
+#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1579,7 +1584,6 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USED_ASYNC		0x00004000	/* Used async_schedule*(), used by module init */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1591,7 +1595,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index df696efdd675..4ecc7939997b 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -263,7 +263,7 @@ int		rpc_malloc(struct rpc_task *);
 void		rpc_free(struct rpc_task *);
 int		rpciod_up(void);
 void		rpciod_down(void);
-int		__rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int		rpc_wait_for_completion_task(struct rpc_task *task);
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 struct net;
 void		rpc_show_tasks(struct net *);
@@ -274,11 +274,6 @@ extern struct workqueue_struct *rpciod_workqueue;
 extern struct workqueue_struct *xprtiod_workqueue;
 void		rpc_prepare_task(struct rpc_task *task);
 
-static inline int rpc_wait_for_completion_task(struct rpc_task *task)
-{
-	return __rpc_wait_for_completion_task(task, NULL);
-}
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
 static inline const char * rpc_qname(const struct rpc_wait_queue *q)
 {
diff --git a/include/linux/wait.h b/include/linux/wait.h
index fe10e8570a52..88fe2c50637a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -335,8 +335,8 @@ do {										\
 } while (0)
 
 #define __wait_event_freezable(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-			    freezable_schedule())
+	___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),	\
+			0, 0, schedule())
 
 /**
  * wait_event_freezable - sleep (or freeze) until a condition gets true
@@ -394,8 +394,8 @@ do {										\
 
 #define __wait_event_freezable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
-		      TASK_INTERRUPTIBLE, 0, timeout,				\
-		      __ret = freezable_schedule_timeout(__ret))
+		      (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,		\
+		      __ret = schedule_timeout(__ret))
 
 /*
  * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
@@ -615,8 +615,8 @@ do {										\
 
 
 #define __wait_event_freezable_exclusive(wq, condition)				\
-	___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,			\
-			freezable_schedule())
+	___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
+			schedule())
 
 #define wait_event_freezable_exclusive(wq, condition)				\
 ({										\
@@ -905,6 +905,33 @@ extern int do_wait_intr_irq(wait_queue_head_t *, wait_queue_entry_t *);
 	__ret;									\
 })
 
+#define __wait_event_state(wq, condition, state)				\
+	___wait_event(wq, condition, state, 0, 0, schedule())
+
+/**
+ * wait_event_state - sleep until a condition gets true
+ * @wq_head: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ *
+ * The process is put to sleep (TASK_KILLABLE) until the
+ * @condition evaluates to true or a signal is received.
+ * The @condition is checked each time the waitqueue @wq_head is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function will return -ERESTARTSYS if it was interrupted by a
+ * signal and 0 if @condition evaluated to true.
+ */
+#define wait_event_state(wq_head, condition, state)				\
+({										\
+	int __ret = 0;								\
+	might_sleep();								\
+	if (!(condition))							\
+		__ret = __wait_event_killable_freezable(wq_head, condition, state); \
+	__ret;									\
+})
+
 #define __wait_event_killable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
 		      TASK_KILLABLE, 0, timeout,				\
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 533d81ed74d4..2f1227053694 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -80,10 +80,9 @@ static void __init handle_initrd(void)
 	init_chdir("/old");
 
 	/*
-	 * In case that a resume from disk is carried out by linuxrc or one of
-	 * its children, we need to tell the freezer not to wait for us.
+	 * PF_FREEZER_SKIP is pointless because UMH is laundered through a
+	 * workqueue and that doesn't have the current context.
 	 */
-	current->flags |= PF_FREEZER_SKIP;
 
 	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
 					 GFP_KERNEL, init_linuxrc, NULL, NULL);
@@ -91,8 +90,6 @@ static void __init handle_initrd(void)
 		return;
 	call_usermodehelper_exec(info, UMH_WAIT_PROC);
 
-	current->flags &= ~PF_FREEZER_SKIP;
-
 	/* move initrd to rootfs' /old */
 	init_mount("..", ".", NULL, MS_MOVE, NULL);
 	/* switch root and cwd back to / of rootfs */
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..1a628f36c42a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -468,10 +468,10 @@ static void exit_mm(void)
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
-			freezable_schedule();
+			schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index e595e77913eb..cd332f6816c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1272,11 +1272,9 @@ static int wait_for_vfork_done(struct task_struct *child,
 {
 	int killed;
 
-	freezer_do_not_count();
 	cgroup_enter_frozen();
-	killed = wait_for_completion_killable(vfork);
+	killed = wait_for_completion_killable_freezable(vfork);
 	cgroup_leave_frozen(false);
-	freezer_count();
 
 	if (killed) {
 		task_lock(child);
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..c0eccb4b2aab 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,8 +13,8 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
 /* indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
@@ -29,7 +29,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -52,41 +52,56 @@ bool freezing_slow_path(struct task_struct *p)
 }
 EXPORT_SYMBOL(freezing_slow_path);
 
+/* Recursion relies on tail-call optimization to not blow away the stack */
+bool frozen(struct task_struct *p)
+{
+	lockdep_assert_held(&tasklist_lock);
+
+	if (p->state == TASK_FROZEN)
+		return true;
+
+	/*
+	 * If stuck in TRACED, and the ptracer is FROZEN, we're frozen too.
+	 */
+	if (task_is_traced(p))
+		return frozen(p->parent);
+
+	/*
+	 * If stuck in STOPPED and the parent is FROZEN, we're frozen too.
+	 */
+	if (task_is_stopped(p))
+		return frozen(p->real_parent);
+
+	return false;
+}
+
 /* Refrigerator is place where frozen processes are stored :-). */
 bool __refrigerator(bool check_kthr_stop)
 {
-	/* Hmm, should we be allowed to suspend when there are realtime
-	   processes around? */
+	unsigned int state = READ_ONCE(current->state);
 	bool was_frozen = false;
-	long save = current->state;
 
 	pr_debug("%s entered refrigerator\n", current->comm);
 
+	WARN_ON_ONCE(state && !(state & TASK_NORMAL));
+
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		bool freeze;
 
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
+		if (!freeze)
 			break;
+
 		was_frozen = true;
+		set_current_state(TASK_FROZEN);
 		schedule();
 	}
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-	/*
-	 * Restore saved task state before returning.  The mb'd version
-	 * needs to be used; otherwise, it might silently break
-	 * synchronization which depends on ordered task state change.
-	 */
-	set_current_state(save);
-
 	return was_frozen;
 }
 EXPORT_SYMBOL(__refrigerator);
@@ -101,6 +116,37 @@ static void fake_signal_wake_up(struct task_struct *p)
 	}
 }
 
+static bool __freeze_task(struct task_struct *p)
+{
+	unsigned long flags;
+	unsigned int state;
+	bool frozen = false;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	state = READ_ONCE(p->state);
+	if (state & TASK_FREEZABLE) {
+		/*
+		 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE,
+		 * since they can suffer spurious wakeups.
+		 */
+		WARN_ON_ONCE(!(state & TASK_NORMAL));
+
+#ifdef CONFIG_LOCKDEP
+		/*
+		 * It's dangerous to freeze with locks held; there be dragons there.
+		 */
+		if (!(state & __TASK_FREEZABLE_UNSAFE))
+			WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+#endif
+
+		p->state = TASK_FROZEN;
+		frozen = true;
+	}
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return frozen;
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +162,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -148,8 +182,8 @@ void __thaw_task(struct task_struct *p)
 	unsigned long flags;
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
-		wake_up_process(p);
+	WARN_ON_ONCE(freezing(p));
+	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
diff --git a/kernel/futex.c b/kernel/futex.c
index 08008c225bec..5c80ca433fd3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2582,7 +2582,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	 * queue_me() calls spin_unlock() upon completion, both serializing
 	 * access to the hash list and forcing another memory barrier.
 	 */
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	queue_me(q, hb);
 
 	/* Arm the timer */
@@ -2600,7 +2600,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 		 * is no timeout, or if it has yet to expire.
 		 */
 		if (!timeout || timeout->task)
-			freezable_schedule();
+			schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 }
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..33a4f2635970 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(t->state & (TASK_FREEZABLE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 12c7e1bb442f..0817e3913294 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -23,7 +23,8 @@
 
 void lock_system_sleep(void)
 {
-	current->flags |= PF_FREEZER_SKIP;
+	WARN_ON_ONCE(current->flags & PF_NOFREEZE);
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(lock_system_sleep);
@@ -46,7 +47,7 @@ void unlock_system_sleep(void)
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	current->flags &= ~PF_FREEZER_SKIP;
+	current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 50cc63534486..36dee2dcfeab 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user_only)
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user_only)
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -132,7 +130,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(true);
 	pr_info("Freezing user space processes ... ");
@@ -193,7 +191,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index a778554f9dad..b3d2b9ae72e6 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -139,6 +139,12 @@ void __sched wait_for_completion(struct completion *x)
 }
 EXPORT_SYMBOL(wait_for_completion);
 
+void __sched wait_for_completion_freezable(struct completion *x)
+{
+	wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+}
+EXPORT_SYMBOL(wait_for_completion_freezable);
+
 /**
  * wait_for_completion_timeout: - waits for completion of a task (w/timeout)
  * @x:  holds the state of this particular completion
@@ -247,6 +253,16 @@ int __sched wait_for_completion_killable(struct completion *x)
 }
 EXPORT_SYMBOL(wait_for_completion_killable);
 
+int __sched wait_for_completion_killable_freezable(struct completion *x)
+{
+	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT,
+				 TASK_KILLABLE|TASK_FREEZABLE);
+	if (t == -ERESTARTSYS)
+		return t;
+	return 0;
+}
+EXPORT_SYMBOL(wait_for_completion_killable_freezable);
+
 /**
  * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
  * @x:  holds the state of this particular completion
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index adea0b1e8036..8ee0e7ccd632 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5893,7 +5893,7 @@ static void __sched notrace __schedule(bool preempt)
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..809ccad87565 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2265,7 +2265,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
 		preempt_enable_no_resched();
-		freezable_schedule();
+		schedule();
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2445,7 +2445,7 @@ static bool do_signal_stop(int signr)
 
 		/* Now we don't run again until woken by SIGCONT or SIGKILL */
 		cgroup_enter_frozen();
-		freezable_schedule();
+		schedule();
 		return true;
 	} else {
 		/*
@@ -2521,11 +2521,11 @@ static void do_freezer_trap(void)
 	 * immediately (if there is a non-fatal signal pending), and
 	 * put the task into sleep.
 	 */
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	clear_thread_flag(TIF_SIGPENDING);
 	spin_unlock_irq(&current->sighand->siglock);
 	cgroup_enter_frozen();
-	freezable_schedule();
+	schedule();
 }
 
 static int ptrace_signal(int signr, kernel_siginfo_t *info)
@@ -3569,9 +3569,9 @@ static int do_sigtimedwait(const sigset_t *which, kernel_siginfo_t *info,
 		recalc_sigpending();
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		__set_current_state(TASK_INTERRUPTIBLE);
-		ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
-							 HRTIMER_MODE_REL);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
+					       HRTIMER_MODE_REL);
 		spin_lock_irq(&tsk->sighand->siglock);
 		__set_task_blocked(tsk, &tsk->real_blocked);
 		sigemptyset(&tsk->real_blocked);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4a66725b1d4a..a0e1f466254c 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1889,11 +1889,11 @@ static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mod
 	struct restart_block *restart;
 
 	do {
-		set_current_state(TASK_INTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		hrtimer_sleeper_start_expires(t, mode);
 
 		if (likely(t->task))
-			freezable_schedule();
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6c0185fdd815..9806779092a9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -794,8 +794,8 @@ static void khugepaged_alloc_sleep(void)
 	DEFINE_WAIT(wait);
 
 	add_wait_queue(&khugepaged_wait, &wait);
-	freezable_schedule_timeout_interruptible(
-		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 39ed0e0afe6d..914f30bd23aa 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -268,7 +268,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -324,14 +324,12 @@ static int rpc_complete_task(struct rpc_task *task)
  * to enforce taking of the wq->lock and hence avoid races with
  * rpc_complete_task().
  */
-int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
+int rpc_wait_for_completion_task(struct rpc_task *task)
 {
-	if (action == NULL)
-		action = rpc_wait_bit_killable;
 	return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
-			action, TASK_KILLABLE);
+			rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
-EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
+EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
 
 /*
  * Make an RPC task runnable.
@@ -928,7 +926,7 @@ static void __rpc_execute(struct rpc_task *task)
 		trace_rpc_task_sync_sleep(task, task->tk_action);
 		status = out_of_line_wait_on_bit(&task->tk_runstate,
 				RPC_TASK_QUEUED, rpc_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE);
 		if (status < 0) {
 			/*
 			 * When a sync task receives a signal, it exits with
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5a31307ceb76..e5b83598b695 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2190,13 +2190,14 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 				  struct sk_buff *last, unsigned int last_len,
 				  bool freezable)
 {
+	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, state);
 
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (tail != last ||
@@ -2209,10 +2210,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		if (freezable)
-			timeo = freezable_schedule_timeout(timeo);
-		else
-			timeo = schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-01  8:21   ` [RFC][PATCH] freezer,sched: Rewrite core freezer logic Peter Zijlstra
@ 2021-06-01 11:27     ` Peter Zijlstra
  2021-06-02 12:54       ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-06-01 11:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

On Tue, Jun 01, 2021 at 10:21:15AM +0200, Peter Zijlstra wrote:
> 
> Hi,
> 
> This here rewrites the core freezer to behave better wrt thawing. By
> replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
> ensured frozen tasks stay frozen until woken and don't randomly wake up
> early, as is currently possible.
> 
> As such, it does away with PF_FROZEN and PF_FREEZER_SKIP (yay).
> 
> It does however completely wreck kernel/cgroup/legacy_freezer.c and I've
> not yet spend any time on trying to figure out that code, will do so
> shortly.
> 
> Other than that, the freezer seems to work fine, I've tested it with:
> 
>   echo freezer > /sys/power/pm_test
>   echo mem > /sys/power/state
> 
> Even while having a GDB session active, and that all works.
> 
> Another notable bit is in init/do_mounts_initrd.c; afaict that has been
> 'broken' for quite a while and is simply removed.
> 
> Please have a look.
> 
> Somewhat-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

cgroup crud now compiles, also fixed some allmodconfig fails.

---
 drivers/android/binder.c       |   4 +-
 drivers/media/pci/pt3/pt3.c    |   4 +-
 fs/cifs/inode.c                |   4 +-
 fs/cifs/transport.c            |   5 +-
 fs/coredump.c                  |   4 +-
 fs/nfs/file.c                  |   3 +-
 fs/nfs/inode.c                 |  12 +-
 fs/nfs/nfs3proc.c              |   3 +-
 fs/nfs/nfs4proc.c              |  14 +--
 fs/nfs/nfs4state.c             |   3 +-
 fs/nfs/pnfs.c                  |   4 +-
 fs/xfs/xfs_trans_ail.c         |   8 +-
 include/linux/completion.h     |   2 +
 include/linux/freezer.h        | 244 ++---------------------------------------
 include/linux/sched.h          |   9 +-
 include/linux/sunrpc/sched.h   |   7 +-
 include/linux/wait.h           |  40 ++++++-
 init/do_mounts_initrd.c        |   7 +-
 kernel/cgroup/legacy_freezer.c |  23 ++--
 kernel/exit.c                  |   4 +-
 kernel/fork.c                  |   4 +-
 kernel/freezer.c               | 115 +++++++++++++------
 kernel/futex.c                 |   4 +-
 kernel/hung_task.c             |   4 +-
 kernel/power/main.c            |   5 +-
 kernel/power/process.c         |  10 +-
 kernel/sched/completion.c      |  16 +++
 kernel/sched/core.c            |   2 +-
 kernel/signal.c                |  14 +--
 kernel/time/hrtimer.c          |   4 +-
 mm/khugepaged.c                |   4 +-
 net/sunrpc/sched.c             |  12 +-
 net/unix/af_unix.c             |   8 +-
 33 files changed, 225 insertions(+), 381 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index bcec598b89f2..70b1dbf2c2b1 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -3712,10 +3712,9 @@ static int binder_wait_for_work(struct binder_thread *thread,
 	struct binder_proc *proc = thread->proc;
 	int ret = 0;
 
-	freezer_do_not_count();
 	binder_inner_proc_lock(proc);
 	for (;;) {
-		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		if (binder_has_work_ilocked(thread, do_proc_work))
 			break;
 		if (do_proc_work)
@@ -3732,7 +3731,6 @@ static int binder_wait_for_work(struct binder_thread *thread,
 	}
 	finish_wait(&thread->wait, &wait);
 	binder_inner_proc_unlock(proc);
-	freezer_count();
 
 	return ret;
 }
diff --git a/drivers/media/pci/pt3/pt3.c b/drivers/media/pci/pt3/pt3.c
index c0bc86793355..cc3cfbc0e33c 100644
--- a/drivers/media/pci/pt3/pt3.c
+++ b/drivers/media/pci/pt3/pt3.c
@@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data)
 		pt3_proc_dma(adap);
 
 		delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC);
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		freezable_schedule_hrtimeout_range(&delay,
+		set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_hrtimeout_range(&delay,
 					PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC,
 					HRTIMER_MODE_REL);
 	}
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 1dfa57982522..8fccf35392ec 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2280,7 +2280,7 @@ cifs_invalidate_mapping(struct inode *inode)
 static int
 cifs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -2297,7 +2297,7 @@ cifs_revalidate_mapping(struct inode *inode)
 		return 0;
 
 	rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable,
-				     TASK_KILLABLE);
+				     TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (rc)
 		return rc;
 
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index c1725b55f364..9258101bafbb 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -771,8 +771,9 @@ wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ)
 {
 	int error;
 
-	error = wait_event_freezekillable_unsafe(server->response_q,
-				    midQ->mid_state != MID_REQUEST_SUBMITTED);
+	error = wait_event_state(server->response_q,
+				 midQ->mid_state != MID_REQUEST_SUBMITTED,
+				 (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
 	if (error < 0)
 		return -ERESTARTSYS;
 
diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..50d5db61e6e0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -465,9 +465,7 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
 	if (core_waiters > 0) {
 		struct core_thread *ptr;
 
-		freezer_do_not_count();
-		wait_for_completion(&core_state->startup);
-		freezer_count();
+		wait_for_completion_freezable(&core_state->startup);
 		/*
 		 * Wait for all the threads to become inactive, so that
 		 * all the thread context (extended register state, like
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1fef107961bc..41e4bb19023d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -558,7 +558,8 @@ static vm_fault_t nfs_vm_page_mkwrite(struct vm_fault *vmf)
 	nfs_fscache_wait_on_page_write(NFS_I(inode), page);
 
 	wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
+			   nfs_wait_bit_killable,
+			   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 
 	lock_page(page);
 	mapping = page_file_mapping(page);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 529c4099f482..df97984e9e5d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
 	return nfs_fileid_to_ino_t(fattr->fileid);
 }
 
-static int nfs_wait_killable(int mode)
+int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
 }
-
-int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
-{
-	return nfs_wait_killable(mode);
-}
 EXPORT_SYMBOL_GPL(nfs_wait_bit_killable);
 
 /**
@@ -1323,7 +1318,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
 	 */
 	for (;;) {
 		ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING,
-					 nfs_wait_bit_killable, TASK_KILLABLE);
+					 nfs_wait_bit_killable,
+					 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (ret)
 			goto out;
 		spin_lock(&inode->i_lock);
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 5c4e23abc345..85043fc451e9 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags)
 		res = rpc_call_sync(clnt, msg, flags);
 		if (res != -EJUKEBOX)
 			break;
-		freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+		__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+		schedule_timeout(NFS_JUKEBOX_RETRY_TIME);
 		res = -ERESTARTSYS;
 	} while (!fatal_signal_pending(current));
 	return res;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 87d04f2c9385..925c8bc217b5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -411,8 +411,8 @@ static int nfs4_delay_killable(long *timeout)
 {
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
+	__set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!__fatal_signal_pending(current))
 		return 0;
 	return -EINTR;
@@ -422,7 +422,8 @@ static int nfs4_delay_interruptible(long *timeout)
 {
 	might_sleep();
 
-	freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE);
+	schedule_timeout(nfs4_update_delay(timeout));
 	if (!signal_pending(current))
 		return 0;
 	return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS;
@@ -7280,7 +7281,8 @@ nfs4_retry_setlk_simple(struct nfs4_state *state, int cmd,
 		status = nfs4_proc_setlk(state, cmd, request);
 		if ((status != -EAGAIN) || IS_SETLK(cmd))
 			break;
-		freezable_schedule_timeout_interruptible(timeout);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		schedule_timeout(timeout);
 		timeout *= 2;
 		timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout);
 		status = -ERESTARTSYS;
@@ -7348,10 +7350,8 @@ nfs4_retry_setlk(struct nfs4_state *state, int cmd, struct file_lock *request)
 			break;
 
 		status = -ERESTARTSYS;
-		freezer_do_not_count();
-		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE,
+		wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE,
 			   NFS4_LOCK_MAXTIMEOUT);
-		freezer_count();
 	} while (!signalled());
 
 	remove_wait_queue(q, &waiter.wait);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f22818a80c2c..37f1aea98f8f 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1307,7 +1307,8 @@ int nfs4_wait_clnt_recover(struct nfs_client *clp)
 
 	refcount_inc(&clp->cl_count);
 	res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING,
-				 nfs_wait_bit_killable, TASK_KILLABLE);
+				 nfs_wait_bit_killable,
+				 TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 	if (res)
 		goto out;
 	if (clp->cl_cons_state < 0)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 03e0b34c4a64..807d8d56a09a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1898,7 +1898,7 @@ static int pnfs_prepare_to_retry_layoutget(struct pnfs_layout_hdr *lo)
 	pnfs_layoutcommit_inode(lo->plh_inode, false);
 	return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN,
 				   nfs_wait_bit_killable,
-				   TASK_KILLABLE);
+				   TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
 
 static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo)
@@ -3173,7 +3173,7 @@ pnfs_layoutcommit_inode(struct inode *inode, bool sync)
 		status = wait_on_bit_lock_action(&nfsi->flags,
 				NFS_INO_LAYOUTCOMMITTING,
 				nfs_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 		if (status)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..bbf135df69c5 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -585,9 +585,9 @@ xfsaild(
 
 	while (1) {
 		if (tout && tout <= 20)
-			set_current_state(TASK_KILLABLE);
+			set_current_state(TASK_KILLABLE|TASK_FREEZABLE);
 		else
-			set_current_state(TASK_INTERRUPTIBLE);
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 
 		/*
 		 * Check kthread_should_stop() after we set the task state to
@@ -636,14 +636,14 @@ xfsaild(
 		    ailp->ail_target == ailp->ail_target_prev &&
 		    list_empty(&ailp->ail_buf_list)) {
 			spin_unlock(&ailp->ail_lock);
-			freezable_schedule();
+			schedule();
 			tout = 0;
 			continue;
 		}
 		spin_unlock(&ailp->ail_lock);
 
 		if (tout)
-			freezable_schedule_timeout(msecs_to_jiffies(tout));
+			schedule_timeout(msecs_to_jiffies(tout));
 
 		__set_current_state(TASK_RUNNING);
 
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 51d9ab079629..76f8b5dac648 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -100,9 +100,11 @@ static inline void reinit_completion(struct completion *x)
 }
 
 extern void wait_for_completion(struct completion *);
+extern void wait_for_completion_freezable(struct completion *);
 extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
+extern int wait_for_completion_killable_freezable(struct completion *x);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
 						   unsigned long timeout);
 extern unsigned long wait_for_completion_io_timeout(struct completion *x,
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 0621c5f86c39..c1cece97fd82 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -8,9 +8,11 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <linux/jump_label.h>
 
 #ifdef CONFIG_FREEZER
-extern atomic_t system_freezing_cnt;	/* nr of freezing conds in effect */
+DECLARE_STATIC_KEY_FALSE(freezer_active);
+
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
@@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs;
 /*
  * Check if a process has been frozen
  */
-static inline bool frozen(struct task_struct *p)
-{
-	return p->flags & PF_FROZEN;
-}
+extern bool frozen(struct task_struct *p);
 
 extern bool freezing_slow_path(struct task_struct *p);
 
@@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct task_struct *p);
  */
 static inline bool freezing(struct task_struct *p)
 {
-	if (likely(!atomic_read(&system_freezing_cnt)))
-		return false;
-	return freezing_slow_path(p);
+	if (static_branch_unlikely(&freezer_active))
+		return freezing_slow_path(p);
+
+	return false;
 }
 
 /* Takes and releases task alloc lock using task_lock() */
@@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void);
 extern void thaw_processes(void);
 extern void thaw_kernel_threads(void);
 
-/*
- * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
- * If try_to_freeze causes a lockdep warning it means the caller may deadlock
- */
-static inline bool try_to_freeze_unsafe(void)
+static inline bool try_to_freeze(void)
 {
 	might_sleep();
 	if (likely(!freezing(current)))
 		return false;
-	return __refrigerator(false);
-}
-
-static inline bool try_to_freeze(void)
-{
 	if (!(current->flags & PF_NOFREEZE))
 		debug_check_no_locks_held();
-	return try_to_freeze_unsafe();
+	return __refrigerator(false);
 }
 
 extern bool freeze_task(struct task_struct *p);
@@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struct task_struct *task)
 }
 #endif /* !CONFIG_CGROUP_FREEZER */
 
-/*
- * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
- * calls wait_for_completion(&vfork) and reset right after it returns from this
- * function.  Next, the parent should call try_to_freeze() to freeze itself
- * appropriately in case the child has exited before the freezing of tasks is
- * complete.  However, we don't want kernel threads to be frozen in unexpected
- * places, so we allow them to block freeze_processes() instead or to set
- * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the
- * parent won't really block freeze_processes(), since ____call_usermodehelper()
- * (the child) does a little before exec/exit and it can't be frozen before
- * waking up the parent.
- */
-
-
-/**
- * freezer_do_not_count - tell freezer to ignore %current
- *
- * Tell freezers to ignore the current task when determining whether the
- * target frozen state is reached.  IOW, the current task will be
- * considered frozen enough by freezers.
- *
- * The caller shouldn't do anything which isn't allowed for a frozen task
- * until freezer_cont() is called.  Usually, freezer[_do_not]_count() pair
- * wrap a scheduling operation and nothing much else.
- */
-static inline void freezer_do_not_count(void)
-{
-	current->flags |= PF_FREEZER_SKIP;
-}
-
-/**
- * freezer_count - tell freezer to stop ignoring %current
- *
- * Undo freezer_do_not_count().  It tells freezers that %current should be
- * considered again and tries to freeze if freezing condition is already in
- * effect.
- */
-static inline void freezer_count(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	/*
-	 * If freezing is in progress, the following paired with smp_mb()
-	 * in freezer_should_skip() ensures that either we see %true
-	 * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	try_to_freeze();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezer_count_unsafe(void)
-{
-	current->flags &= ~PF_FREEZER_SKIP;
-	smp_mb();
-	try_to_freeze_unsafe();
-}
-
-/**
- * freezer_should_skip - whether to skip a task when determining frozen
- *			 state is reached
- * @p: task in quesion
- *
- * This function is used by freezers after establishing %true freezing() to
- * test whether a task should be skipped when determining the target frozen
- * state is reached.  IOW, if this function returns %true, @p is considered
- * frozen enough.
- */
-static inline bool freezer_should_skip(struct task_struct *p)
-{
-	/*
-	 * The following smp_mb() paired with the one in freezer_count()
-	 * ensures that either freezer_count() sees %true freezing() or we
-	 * see cleared %PF_FREEZER_SKIP and return %false.  This makes it
-	 * impossible for a task to slip frozen state testing after
-	 * clearing %PF_FREEZER_SKIP.
-	 */
-	smp_mb();
-	return p->flags & PF_FREEZER_SKIP;
-}
-
-/*
- * These functions are intended to be used whenever you want allow a sleeping
- * task to be frozen. Note that neither return any clear indication of
- * whether a freeze event happened while in this function.
- */
-
-/* Like schedule(), but should not block the freezer. */
-static inline void freezable_schedule(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count();
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline void freezable_schedule_unsafe(void)
-{
-	freezer_do_not_count();
-	schedule();
-	freezer_count_unsafe();
-}
-
-/*
- * Like schedule_timeout(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Like schedule_timeout_interruptible(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline long freezable_schedule_timeout_interruptible(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout)
-{
-	long __retval;
-
-	freezer_do_not_count();
-	__retval = schedule_timeout_interruptible(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/* Like schedule_timeout_killable(), but should not block the freezer. */
-static inline long freezable_schedule_timeout_killable(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count();
-	return __retval;
-}
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-static inline long freezable_schedule_timeout_killable_unsafe(long timeout)
-{
-	long __retval;
-	freezer_do_not_count();
-	__retval = schedule_timeout_killable(timeout);
-	freezer_count_unsafe();
-	return __retval;
-}
-
-/*
- * Like schedule_hrtimeout_range(), but should not block the freezer.  Do not
- * call this with locks held.
- */
-static inline int freezable_schedule_hrtimeout_range(ktime_t *expires,
-		u64 delta, const enum hrtimer_mode mode)
-{
-	int __retval;
-	freezer_do_not_count();
-	__retval = schedule_hrtimeout_range(expires, delta, mode);
-	freezer_count();
-	return __retval;
-}
-
-/*
- * Freezer-friendly wrappers around wait_event_interruptible(),
- * wait_event_killable() and wait_event_interruptible_timeout(), originally
- * defined in <linux/wait.h>
- */
-
-/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-({									\
-	int __retval;							\
-	freezer_do_not_count();						\
-	__retval = wait_event_killable(wq, (condition));		\
-	freezer_count_unsafe();						\
-	__retval;							\
-})
-
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
@@ -281,35 +83,9 @@ static inline void thaw_kernel_threads(void) {}
 
 static inline bool try_to_freeze(void) { return false; }
 
-static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
-static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
 
-#define freezable_schedule()  schedule()
-
-#define freezable_schedule_unsafe()  schedule()
-
-#define freezable_schedule_timeout(timeout)  schedule_timeout(timeout)
-
-#define freezable_schedule_timeout_interruptible(timeout)		\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_interruptible_unsafe(timeout)	\
-	schedule_timeout_interruptible(timeout)
-
-#define freezable_schedule_timeout_killable(timeout)			\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_timeout_killable_unsafe(timeout)		\
-	schedule_timeout_killable(timeout)
-
-#define freezable_schedule_hrtimeout_range(expires, delta, mode)	\
-	schedule_hrtimeout_range(expires, delta, mode)
-
-#define wait_event_freezekillable_unsafe(wq, condition)			\
-		wait_event_killable(wq, condition)
-
 #endif /* !CONFIG_FREEZER */
 
 #endif	/* FREEZER_H_INCLUDED */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2982cfab1ae9..bfadc1dbcf24 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,7 +95,12 @@ struct task_group;
 #define TASK_WAKING			0x0200
 #define TASK_NOLOAD			0x0400
 #define TASK_NEW			0x0800
-#define TASK_STATE_MAX			0x1000
+#define TASK_FREEZABLE			0x1000
+#define __TASK_FREEZABLE_UNSAFE		0x2000
+#define TASK_FROZEN			0x4000
+#define TASK_STATE_MAX			0x8000
+
+#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
 
 /* Convenience macros for the sake of set_current_state: */
 #define TASK_KILLABLE			(TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
@@ -1579,7 +1584,6 @@ extern struct pid *cad_pid;
 #define PF_USED_MATH		0x00002000	/* If unset the fpu must be initialized before use */
 #define PF_USED_ASYNC		0x00004000	/* Used async_schedule*(), used by module init */
 #define PF_NOFREEZE		0x00008000	/* This thread should not be frozen */
-#define PF_FROZEN		0x00010000	/* Frozen for system suspend */
 #define PF_KSWAPD		0x00020000	/* I am kswapd */
 #define PF_MEMALLOC_NOFS	0x00040000	/* All allocation requests will inherit GFP_NOFS */
 #define PF_MEMALLOC_NOIO	0x00080000	/* All allocation requests will inherit GFP_NOIO */
@@ -1591,7 +1595,6 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
-#define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index df696efdd675..4ecc7939997b 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -263,7 +263,7 @@ int		rpc_malloc(struct rpc_task *);
 void		rpc_free(struct rpc_task *);
 int		rpciod_up(void);
 void		rpciod_down(void);
-int		__rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *);
+int		rpc_wait_for_completion_task(struct rpc_task *task);
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 struct net;
 void		rpc_show_tasks(struct net *);
@@ -274,11 +274,6 @@ extern struct workqueue_struct *rpciod_workqueue;
 extern struct workqueue_struct *xprtiod_workqueue;
 void		rpc_prepare_task(struct rpc_task *task);
 
-static inline int rpc_wait_for_completion_task(struct rpc_task *task)
-{
-	return __rpc_wait_for_completion_task(task, NULL);
-}
-
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS)
 static inline const char * rpc_qname(const struct rpc_wait_queue *q)
 {
diff --git a/include/linux/wait.h b/include/linux/wait.h
index fe10e8570a52..f017880c25f0 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -335,8 +335,8 @@ do {										\
 } while (0)
 
 #define __wait_event_freezable(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-			    freezable_schedule())
+	___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE),	\
+			0, 0, schedule())
 
 /**
  * wait_event_freezable - sleep (or freeze) until a condition gets true
@@ -394,8 +394,8 @@ do {										\
 
 #define __wait_event_freezable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
-		      TASK_INTERRUPTIBLE, 0, timeout,				\
-		      __ret = freezable_schedule_timeout(__ret))
+		      (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout,		\
+		      __ret = schedule_timeout(__ret))
 
 /*
  * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
@@ -615,8 +615,8 @@ do {										\
 
 
 #define __wait_event_freezable_exclusive(wq, condition)				\
-	___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0,			\
-			freezable_schedule())
+	___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\
+			schedule())
 
 #define wait_event_freezable_exclusive(wq, condition)				\
 ({										\
@@ -905,6 +905,34 @@ extern int do_wait_intr_irq(wait_queue_head_t *, wait_queue_entry_t *);
 	__ret;									\
 })
 
+#define __wait_event_state(wq, condition, state)				\
+	___wait_event(wq, condition, state, 0, 0, schedule())
+
+/**
+ * wait_event_state - sleep until a condition gets true
+ * @wq_head: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ * @state: state to sleep in
+ *
+ * The process is put to sleep (@state) until the @condition evaluates to true
+ * or a signal is received.  The @condition is checked each time the waitqueue
+ * @wq_head is woken up.
+ *
+ * wake_up() has to be called after changing any variable that could
+ * change the result of the wait condition.
+ *
+ * The function will return -ERESTARTSYS if it was interrupted by a
+ * signal and 0 if @condition evaluated to true.
+ */
+#define wait_event_state(wq_head, condition, state)				\
+({										\
+	int __ret = 0;								\
+	might_sleep();								\
+	if (!(condition))							\
+		__ret = __wait_event_state(wq_head, condition, state);		\
+	__ret;									\
+})
+
 #define __wait_event_killable_timeout(wq_head, condition, timeout)		\
 	___wait_event(wq_head, ___wait_cond_timeout(condition),			\
 		      TASK_KILLABLE, 0, timeout,				\
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index 533d81ed74d4..2f1227053694 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -80,10 +80,9 @@ static void __init handle_initrd(void)
 	init_chdir("/old");
 
 	/*
-	 * In case that a resume from disk is carried out by linuxrc or one of
-	 * its children, we need to tell the freezer not to wait for us.
+	 * PF_FREEZER_SKIP is pointless because UMH is laundered through a
+	 * workqueue and that doesn't have the current context.
 	 */
-	current->flags |= PF_FREEZER_SKIP;
 
 	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
 					 GFP_KERNEL, init_linuxrc, NULL, NULL);
@@ -91,8 +90,6 @@ static void __init handle_initrd(void)
 		return;
 	call_usermodehelper_exec(info, UMH_WAIT_PROC);
 
-	current->flags &= ~PF_FREEZER_SKIP;
-
 	/* move initrd to rootfs' /old */
 	init_mount("..", ".", NULL, MS_MOVE, NULL);
 	/* switch root and cwd back to / of rootfs */
diff --git a/kernel/cgroup/legacy_freezer.c b/kernel/cgroup/legacy_freezer.c
index 08236798d173..1b6b21851e9d 100644
--- a/kernel/cgroup/legacy_freezer.c
+++ b/kernel/cgroup/legacy_freezer.c
@@ -113,7 +113,7 @@ static int freezer_css_online(struct cgroup_subsys_state *css)
 
 	if (parent && (parent->state & CGROUP_FREEZING)) {
 		freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 	}
 
 	mutex_unlock(&freezer_mutex);
@@ -134,7 +134,7 @@ static void freezer_css_offline(struct cgroup_subsys_state *css)
 	mutex_lock(&freezer_mutex);
 
 	if (freezer->state & CGROUP_FREEZING)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 
 	freezer->state = 0;
 
@@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup_taskset *tset)
 			__thaw_task(task);
 		} else {
 			freeze_task(task);
+
 			/* clear FROZEN and propagate upwards */
 			while (freezer && (freezer->state & CGROUP_FROZEN)) {
 				freezer->state &= ~CGROUP_FROZEN;
@@ -271,16 +272,8 @@ static void update_if_frozen(struct cgroup_subsys_state *css)
 	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
-		if (freezing(task)) {
-			/*
-			 * freezer_should_skip() indicates that the task
-			 * should be skipped when determining freezing
-			 * completion.  Consider it frozen in addition to
-			 * the usual frozen condition.
-			 */
-			if (!frozen(task) && !freezer_should_skip(task))
-				goto out_iter_end;
-		}
+		if (freezing(task) && !frozen(task))
+			goto out_iter_end;
 	}
 
 	freezer->state |= CGROUP_FROZEN;
@@ -357,7 +350,7 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
 
 	if (freeze) {
 		if (!(freezer->state & CGROUP_FREEZING))
-			atomic_inc(&system_freezing_cnt);
+			static_branch_inc(&freezer_active);
 		freezer->state |= state;
 		freeze_cgroup(freezer);
 	} else {
@@ -366,9 +359,9 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
 		freezer->state &= ~state;
 
 		if (!(freezer->state & CGROUP_FREEZING)) {
-			if (was_freezing)
-				atomic_dec(&system_freezing_cnt);
 			freezer->state &= ~CGROUP_FROZEN;
+			if (was_freezing)
+				static_branch_dec(&freezer_active);
 			unfreeze_cgroup(freezer);
 		}
 	}
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..1a628f36c42a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -468,10 +468,10 @@ static void exit_mm(void)
 			complete(&core_state->startup);
 
 		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
 			if (!self.task) /* see coredump_finish() */
 				break;
-			freezable_schedule();
+			schedule();
 		}
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
diff --git a/kernel/fork.c b/kernel/fork.c
index e595e77913eb..cd332f6816c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1272,11 +1272,9 @@ static int wait_for_vfork_done(struct task_struct *child,
 {
 	int killed;
 
-	freezer_do_not_count();
 	cgroup_enter_frozen();
-	killed = wait_for_completion_killable(vfork);
+	killed = wait_for_completion_killable_freezable(vfork);
 	cgroup_leave_frozen(false);
-	freezer_count();
 
 	if (killed) {
 		task_lock(child);
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..df235fba6989 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -13,8 +13,8 @@
 #include <linux/kthread.h>
 
 /* total number of freezing conditions in effect */
-atomic_t system_freezing_cnt = ATOMIC_INIT(0);
-EXPORT_SYMBOL(system_freezing_cnt);
+DEFINE_STATIC_KEY_FALSE(freezer_active);
+EXPORT_SYMBOL(freezer_active);
 
 /* indicate whether PM freezing is in effect, protected by
  * system_transition_mutex
@@ -29,7 +29,7 @@ static DEFINE_SPINLOCK(freezer_lock);
  * freezing_slow_path - slow path for testing whether a task needs to be frozen
  * @p: task to be tested
  *
- * This function is called by freezing() if system_freezing_cnt isn't zero
+ * This function is called by freezing() if freezer_active isn't zero
  * and tests whether @p needs to enter and stay in frozen state.  Can be
  * called under any context.  The freezers are responsible for ensuring the
  * target tasks see the updated state.
@@ -52,41 +52,67 @@ bool freezing_slow_path(struct task_struct *p)
 }
 EXPORT_SYMBOL(freezing_slow_path);
 
+/* Recursion relies on tail-call optimization to not blow away the stack */
+static bool __frozen(struct task_struct *p)
+{
+	if (p->state == TASK_FROZEN)
+		return true;
+
+	/*
+	 * If stuck in TRACED, and the ptracer is FROZEN, we're frozen too.
+	 */
+	if (task_is_traced(p))
+		return frozen(rcu_dereference(p->parent));
+
+	/*
+	 * If stuck in STOPPED and the parent is FROZEN, we're frozen too.
+	 */
+	if (task_is_stopped(p))
+		return frozen(rcu_dereference(p->real_parent));
+
+	return false;
+}
+
+bool frozen(struct task_struct *p)
+{
+	bool ret;
+
+	rcu_read_lock();
+	ret = __frozen(p);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /* Refrigerator is place where frozen processes are stored :-). */
 bool __refrigerator(bool check_kthr_stop)
 {
-	/* Hmm, should we be allowed to suspend when there are realtime
-	   processes around? */
+	unsigned int state = READ_ONCE(current->state);
 	bool was_frozen = false;
-	long save = current->state;
 
 	pr_debug("%s entered refrigerator\n", current->comm);
 
+	WARN_ON_ONCE(state && !(state & TASK_NORMAL));
+
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		bool freeze;
+
+		set_current_state(TASK_FROZEN);
 
 		spin_lock_irq(&freezer_lock);
-		current->flags |= PF_FROZEN;
-		if (!freezing(current) ||
-		    (check_kthr_stop && kthread_should_stop()))
-			current->flags &= ~PF_FROZEN;
+		freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop());
 		spin_unlock_irq(&freezer_lock);
 
-		if (!(current->flags & PF_FROZEN))
+		if (!freeze)
 			break;
+
 		was_frozen = true;
 		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-	/*
-	 * Restore saved task state before returning.  The mb'd version
-	 * needs to be used; otherwise, it might silently break
-	 * synchronization which depends on ordered task state change.
-	 */
-	set_current_state(save);
-
 	return was_frozen;
 }
 EXPORT_SYMBOL(__refrigerator);
@@ -101,6 +127,37 @@ static void fake_signal_wake_up(struct task_struct *p)
 	}
 }
 
+static bool __freeze_task(struct task_struct *p)
+{
+	unsigned long flags;
+	unsigned int state;
+	bool frozen = false;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	state = READ_ONCE(p->state);
+	if (state & TASK_FREEZABLE) {
+		/*
+		 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE,
+		 * since they can suffer spurious wakeups.
+		 */
+		WARN_ON_ONCE(!(state & TASK_NORMAL));
+
+#ifdef CONFIG_LOCKDEP
+		/*
+		 * It's dangerous to freeze with locks held; there be dragons there.
+		 */
+		if (!(state & __TASK_FREEZABLE_UNSAFE))
+			WARN_ON_ONCE(debug_locks && p->lockdep_depth);
+#endif
+
+		p->state = TASK_FROZEN;
+		frozen = true;
+	}
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return frozen;
+}
+
 /**
  * freeze_task - send a freeze request to given task
  * @p: task to send the request to
@@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
-	/*
-	 * This check can race with freezer_do_not_count, but worst case that
-	 * will result in an extra wakeup being sent to the task.  It does not
-	 * race with freezer_count(), the barriers in freezer_count() and
-	 * freezer_should_skip() ensure that either freezer_count() sees
-	 * freezing == true in try_to_freeze() and freezes, or
-	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
-	 * normally.
-	 */
-	if (freezer_should_skip(p))
-		return false;
-
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (!freezing(p) || frozen(p)) {
+	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
 		spin_unlock_irqrestore(&freezer_lock, flags);
 		return false;
 	}
@@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
 	if (!(p->flags & PF_KTHREAD))
 		fake_signal_wake_up(p);
 	else
-		wake_up_state(p, TASK_INTERRUPTIBLE);
+		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
 
 	spin_unlock_irqrestore(&freezer_lock, flags);
 	return true;
@@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
 	unsigned long flags;
 
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p))
-		wake_up_process(p);
+	WARN_ON_ONCE(freezing(p));
+	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
diff --git a/kernel/futex.c b/kernel/futex.c
index 08008c225bec..5c80ca433fd3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2582,7 +2582,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	 * queue_me() calls spin_unlock() upon completion, both serializing
 	 * access to the hash list and forcing another memory barrier.
 	 */
-	set_current_state(TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	queue_me(q, hb);
 
 	/* Arm the timer */
@@ -2600,7 +2600,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 		 * is no timeout, or if it has yet to expire.
 		 */
 		if (!timeout || timeout->task)
-			freezable_schedule();
+			schedule();
 	}
 	__set_current_state(TASK_RUNNING);
 }
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 396ebaebea3f..33a4f2635970 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -92,8 +92,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 * Ensure the task is not frozen.
 	 * Also, skip vfork and any other user process that freezer should skip.
 	 */
-	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
-	    return;
+	if (unlikely(t->state & (TASK_FREEZABLE | TASK_FROZEN)))
+		return;
 
 	/*
 	 * When a freshly created task is scheduled once, changes its state to
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 12c7e1bb442f..0817e3913294 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -23,7 +23,8 @@
 
 void lock_system_sleep(void)
 {
-	current->flags |= PF_FREEZER_SKIP;
+	WARN_ON_ONCE(current->flags & PF_NOFREEZE);
+	current->flags |= PF_NOFREEZE;
 	mutex_lock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(lock_system_sleep);
@@ -46,7 +47,7 @@ void unlock_system_sleep(void)
 	 * Which means, if we use try_to_freeze() here, it would make them
 	 * enter the refrigerator, thus causing hibernation to lockup.
 	 */
-	current->flags &= ~PF_FREEZER_SKIP;
+	current->flags &= ~PF_NOFREEZE;
 	mutex_unlock(&system_transition_mutex);
 }
 EXPORT_SYMBOL_GPL(unlock_system_sleep);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 50cc63534486..36dee2dcfeab 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user_only)
 			if (p == current || !freeze_task(p))
 				continue;
 
-			if (!freezer_should_skip(p))
-				todo++;
+			todo++;
 		}
 		read_unlock(&tasklist_lock);
 
@@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user_only)
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && !freezer_should_skip(p)
-				    && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p))
 					sched_show_task(p);
 			}
 			read_unlock(&tasklist_lock);
@@ -132,7 +130,7 @@ int freeze_processes(void)
 	current->flags |= PF_SUSPEND_TASK;
 
 	if (!pm_freezing)
-		atomic_inc(&system_freezing_cnt);
+		static_branch_inc(&freezer_active);
 
 	pm_wakeup_clear(true);
 	pr_info("Freezing user space processes ... ");
@@ -193,7 +191,7 @@ void thaw_processes(void)
 
 	trace_suspend_resume(TPS("thaw_processes"), 0, true);
 	if (pm_freezing)
-		atomic_dec(&system_freezing_cnt);
+		static_branch_dec(&freezer_active);
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index a778554f9dad..b3d2b9ae72e6 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -139,6 +139,12 @@ void __sched wait_for_completion(struct completion *x)
 }
 EXPORT_SYMBOL(wait_for_completion);
 
+void __sched wait_for_completion_freezable(struct completion *x)
+{
+	wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE|TASK_FREEZABLE);
+}
+EXPORT_SYMBOL(wait_for_completion_freezable);
+
 /**
  * wait_for_completion_timeout: - waits for completion of a task (w/timeout)
  * @x:  holds the state of this particular completion
@@ -247,6 +253,16 @@ int __sched wait_for_completion_killable(struct completion *x)
 }
 EXPORT_SYMBOL(wait_for_completion_killable);
 
+int __sched wait_for_completion_killable_freezable(struct completion *x)
+{
+	long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT,
+				 TASK_KILLABLE|TASK_FREEZABLE);
+	if (t == -ERESTARTSYS)
+		return t;
+	return 0;
+}
+EXPORT_SYMBOL(wait_for_completion_killable_freezable);
+
 /**
  * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable))
  * @x:  holds the state of this particular completion
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index adea0b1e8036..8ee0e7ccd632 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5893,7 +5893,7 @@ static void __sched notrace __schedule(bool preempt)
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
-				!(prev->flags & PF_FROZEN);
+				!(prev_state & TASK_FROZEN);
 
 			if (prev->sched_contributes_to_load)
 				rq->nr_uninterruptible++;
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..809ccad87565 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2265,7 +2265,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
 		preempt_enable_no_resched();
-		freezable_schedule();
+		schedule();
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2445,7 +2445,7 @@ static bool do_signal_stop(int signr)
 
 		/* Now we don't run again until woken by SIGCONT or SIGKILL */
 		cgroup_enter_frozen();
-		freezable_schedule();
+		schedule();
 		return true;
 	} else {
 		/*
@@ -2521,11 +2521,11 @@ static void do_freezer_trap(void)
 	 * immediately (if there is a non-fatal signal pending), and
 	 * put the task into sleep.
 	 */
-	__set_current_state(TASK_INTERRUPTIBLE);
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 	clear_thread_flag(TIF_SIGPENDING);
 	spin_unlock_irq(&current->sighand->siglock);
 	cgroup_enter_frozen();
-	freezable_schedule();
+	schedule();
 }
 
 static int ptrace_signal(int signr, kernel_siginfo_t *info)
@@ -3569,9 +3569,9 @@ static int do_sigtimedwait(const sigset_t *which, kernel_siginfo_t *info,
 		recalc_sigpending();
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		__set_current_state(TASK_INTERRUPTIBLE);
-		ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns,
-							 HRTIMER_MODE_REL);
+		__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns,
+					       HRTIMER_MODE_REL);
 		spin_lock_irq(&tsk->sighand->siglock);
 		__set_task_blocked(tsk, &tsk->real_blocked);
 		sigemptyset(&tsk->real_blocked);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4a66725b1d4a..a0e1f466254c 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1889,11 +1889,11 @@ static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mod
 	struct restart_block *restart;
 
 	do {
-		set_current_state(TASK_INTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
 		hrtimer_sleeper_start_expires(t, mode);
 
 		if (likely(t->task))
-			freezable_schedule();
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6c0185fdd815..9806779092a9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -794,8 +794,8 @@ static void khugepaged_alloc_sleep(void)
 	DEFINE_WAIT(wait);
 
 	add_wait_queue(&khugepaged_wait, &wait);
-	freezable_schedule_timeout_interruptible(
-		msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
+	__set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 39ed0e0afe6d..914f30bd23aa 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -268,7 +268,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
+	schedule();
 	if (signal_pending_state(mode, current))
 		return -ERESTARTSYS;
 	return 0;
@@ -324,14 +324,12 @@ static int rpc_complete_task(struct rpc_task *task)
  * to enforce taking of the wq->lock and hence avoid races with
  * rpc_complete_task().
  */
-int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action)
+int rpc_wait_for_completion_task(struct rpc_task *task)
 {
-	if (action == NULL)
-		action = rpc_wait_bit_killable;
 	return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
-			action, TASK_KILLABLE);
+			rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE);
 }
-EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
+EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task);
 
 /*
  * Make an RPC task runnable.
@@ -928,7 +926,7 @@ static void __rpc_execute(struct rpc_task *task)
 		trace_rpc_task_sync_sleep(task, task->tk_action);
 		status = out_of_line_wait_on_bit(&task->tk_runstate,
 				RPC_TASK_QUEUED, rpc_wait_bit_killable,
-				TASK_KILLABLE);
+				TASK_KILLABLE|TASK_FREEZABLE);
 		if (status < 0) {
 			/*
 			 * When a sync task receives a signal, it exits with
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5a31307ceb76..e5b83598b695 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2190,13 +2190,14 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 				  struct sk_buff *last, unsigned int last_len,
 				  bool freezable)
 {
+	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, state);
 
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		if (tail != last ||
@@ -2209,10 +2210,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		if (freezable)
-			timeo = freezable_schedule_timeout(timeo);
-		else
-			timeo = schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-01 11:27     ` Peter Zijlstra
@ 2021-06-02 12:54       ` Will Deacon
  2021-06-03 10:35         ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-06-02 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team

Hi Peter,

On Tue, Jun 01, 2021 at 01:27:59PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 01, 2021 at 10:21:15AM +0200, Peter Zijlstra wrote:
> > 
> > Hi,
> > 
> > This here rewrites the core freezer to behave better wrt thawing. By
> > replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
> > ensured frozen tasks stay frozen until woken and don't randomly wake up
> > early, as is currently possible.
> > 
> > As such, it does away with PF_FROZEN and PF_FREEZER_SKIP (yay).
> > 
> > It does however completely wreck kernel/cgroup/legacy_freezer.c and I've
> > not yet spend any time on trying to figure out that code, will do so
> > shortly.
> > 
> > Other than that, the freezer seems to work fine, I've tested it with:
> > 
> >   echo freezer > /sys/power/pm_test
> >   echo mem > /sys/power/state
> > 
> > Even while having a GDB session active, and that all works.
> > 
> > Another notable bit is in init/do_mounts_initrd.c; afaict that has been
> > 'broken' for quite a while and is simply removed.
> > 
> > Please have a look.
> > 
> > Somewhat-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> cgroup crud now compiles, also fixed some allmodconfig fails.

There's a lot here, but generally I really like the look of it, especially
making the "freezable" waits explicit. I've left a few comments below.

>  drivers/android/binder.c       |   4 +-
>  drivers/media/pci/pt3/pt3.c    |   4 +-
>  fs/cifs/inode.c                |   4 +-
>  fs/cifs/transport.c            |   5 +-
>  fs/coredump.c                  |   4 +-
>  fs/nfs/file.c                  |   3 +-
>  fs/nfs/inode.c                 |  12 +-
>  fs/nfs/nfs3proc.c              |   3 +-
>  fs/nfs/nfs4proc.c              |  14 +--
>  fs/nfs/nfs4state.c             |   3 +-
>  fs/nfs/pnfs.c                  |   4 +-
>  fs/xfs/xfs_trans_ail.c         |   8 +-
>  include/linux/completion.h     |   2 +
>  include/linux/freezer.h        | 244 ++---------------------------------------
>  include/linux/sched.h          |   9 +-
>  include/linux/sunrpc/sched.h   |   7 +-
>  include/linux/wait.h           |  40 ++++++-
>  init/do_mounts_initrd.c        |   7 +-
>  kernel/cgroup/legacy_freezer.c |  23 ++--
>  kernel/exit.c                  |   4 +-
>  kernel/fork.c                  |   4 +-
>  kernel/freezer.c               | 115 +++++++++++++------
>  kernel/futex.c                 |   4 +-
>  kernel/hung_task.c             |   4 +-
>  kernel/power/main.c            |   5 +-
>  kernel/power/process.c         |  10 +-
>  kernel/sched/completion.c      |  16 +++
>  kernel/sched/core.c            |   2 +-
>  kernel/signal.c                |  14 +--
>  kernel/time/hrtimer.c          |   4 +-
>  mm/khugepaged.c                |   4 +-
>  net/sunrpc/sched.c             |  12 +-
>  net/unix/af_unix.c             |   8 +-
>  33 files changed, 225 insertions(+), 381 deletions(-)

There's also Documentation/power/freezing-of-tasks.rst to update. I'm not
sure if fs/proc/array.c should be updated to display frozen tasks; I
couldn't see how that was useful, but thought I'd mention it anyway.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2982cfab1ae9..bfadc1dbcf24 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -95,7 +95,12 @@ struct task_group;
>  #define TASK_WAKING			0x0200
>  #define TASK_NOLOAD			0x0400
>  #define TASK_NEW			0x0800
> -#define TASK_STATE_MAX			0x1000
> +#define TASK_FREEZABLE			0x1000
> +#define __TASK_FREEZABLE_UNSAFE		0x2000

Give that this is only needed to avoid lockdep checks, maybe we should avoid
allocating the bit if lockdep is not enabled? Otherwise, people might start
to use it for other things.

> +#define TASK_FROZEN			0x4000
> +#define TASK_STATE_MAX			0x8000
> +
> +#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)

We probably want to preserve the "DO NOT ADD ANY NEW CALLERS OF THIS STATE"
comment for the unsafe stuff.

> diff --git a/kernel/freezer.c b/kernel/freezer.c
> index dc520f01f99d..df235fba6989 100644
> --- a/kernel/freezer.c
> +++ b/kernel/freezer.c
> @@ -13,8 +13,8 @@
>  #include <linux/kthread.h>
>  
>  /* total number of freezing conditions in effect */
> -atomic_t system_freezing_cnt = ATOMIC_INIT(0);
> -EXPORT_SYMBOL(system_freezing_cnt);
> +DEFINE_STATIC_KEY_FALSE(freezer_active);
> +EXPORT_SYMBOL(freezer_active);
>  
>  /* indicate whether PM freezing is in effect, protected by
>   * system_transition_mutex
> @@ -29,7 +29,7 @@ static DEFINE_SPINLOCK(freezer_lock);
>   * freezing_slow_path - slow path for testing whether a task needs to be frozen
>   * @p: task to be tested
>   *
> - * This function is called by freezing() if system_freezing_cnt isn't zero
> + * This function is called by freezing() if freezer_active isn't zero
>   * and tests whether @p needs to enter and stay in frozen state.  Can be
>   * called under any context.  The freezers are responsible for ensuring the
>   * target tasks see the updated state.
> @@ -52,41 +52,67 @@ bool freezing_slow_path(struct task_struct *p)
>  }
>  EXPORT_SYMBOL(freezing_slow_path);
>  
> +/* Recursion relies on tail-call optimization to not blow away the stack */
> +static bool __frozen(struct task_struct *p)
> +{
> +	if (p->state == TASK_FROZEN)
> +		return true;

READ_ONCE()?

> +
> +	/*
> +	 * If stuck in TRACED, and the ptracer is FROZEN, we're frozen too.
> +	 */
> +	if (task_is_traced(p))
> +		return frozen(rcu_dereference(p->parent));
> +
> +	/*
> +	 * If stuck in STOPPED and the parent is FROZEN, we're frozen too.
> +	 */
> +	if (task_is_stopped(p))
> +		return frozen(rcu_dereference(p->real_parent));

This looks convincing, but I really can't tell if we're missing anything.

> +static bool __freeze_task(struct task_struct *p)
> +{
> +	unsigned long flags;
> +	unsigned int state;
> +	bool frozen = false;
> +
> +	raw_spin_lock_irqsave(&p->pi_lock, flags);
> +	state = READ_ONCE(p->state);
> +	if (state & TASK_FREEZABLE) {
> +		/*
> +		 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE,
> +		 * since they can suffer spurious wakeups.
> +		 */
> +		WARN_ON_ONCE(!(state & TASK_NORMAL));
> +
> +#ifdef CONFIG_LOCKDEP
> +		/*
> +		 * It's dangerous to freeze with locks held; there be dragons there.
> +		 */
> +		if (!(state & __TASK_FREEZABLE_UNSAFE))
> +			WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> +#endif
> +
> +		p->state = TASK_FROZEN;
> +		frozen = true;
> +	}
> +	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +
> +	return frozen;
> +}
> +
>  /**
>   * freeze_task - send a freeze request to given task
>   * @p: task to send the request to
> @@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
>  {
>  	unsigned long flags;
>  
> -	/*
> -	 * This check can race with freezer_do_not_count, but worst case that
> -	 * will result in an extra wakeup being sent to the task.  It does not
> -	 * race with freezer_count(), the barriers in freezer_count() and
> -	 * freezer_should_skip() ensure that either freezer_count() sees
> -	 * freezing == true in try_to_freeze() and freezes, or
> -	 * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task
> -	 * normally.
> -	 */
> -	if (freezer_should_skip(p))
> -		return false;
> -
>  	spin_lock_irqsave(&freezer_lock, flags);
> -	if (!freezing(p) || frozen(p)) {
> +	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
>  		spin_unlock_irqrestore(&freezer_lock, flags);
>  		return false;
>  	}

I've been trying to figure out how this serialises with ttwu(), given that
frozen(p) will go and read p->state. I suppose it works out because only the
freezer can wake up tasks from the FROZEN state, but it feels a bit brittle.

> @@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
>  	if (!(p->flags & PF_KTHREAD))
>  		fake_signal_wake_up(p);
>  	else
> -		wake_up_state(p, TASK_INTERRUPTIBLE);
> +		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
>  
>  	spin_unlock_irqrestore(&freezer_lock, flags);
>  	return true;
> @@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&freezer_lock, flags);
> -	if (frozen(p))
> -		wake_up_process(p);
> +	WARN_ON_ONCE(freezing(p));
> +	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);

Why do we need TASK_NORMAL here?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-02 12:54       ` Will Deacon
@ 2021-06-03 10:35         ` Peter Zijlstra
  2021-06-03 10:58           ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-06-03 10:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Oleg Nesterov

On Wed, Jun 02, 2021 at 01:54:53PM +0100, Will Deacon wrote:

> There's also Documentation/power/freezing-of-tasks.rst to update. I'm not

Since it's .rst, the only update I'm willing to do is delete it
outright.

> sure if fs/proc/array.c should be updated to display frozen tasks; I
> couldn't see how that was useful, but thought I'd mention it anyway.

Yeah, I considered it too, but I figured that if we're all frozen
there's noone left to observe us being frozen, so I didn't bother.

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2982cfab1ae9..bfadc1dbcf24 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -95,7 +95,12 @@ struct task_group;
> >  #define TASK_WAKING			0x0200
> >  #define TASK_NOLOAD			0x0400
> >  #define TASK_NEW			0x0800
> > -#define TASK_STATE_MAX			0x1000
> > +#define TASK_FREEZABLE			0x1000
> > +#define __TASK_FREEZABLE_UNSAFE		0x2000
> 
> Give that this is only needed to avoid lockdep checks, maybe we should avoid
> allocating the bit if lockdep is not enabled? Otherwise, people might start
> to use it for other things.

Something like

#define __TASK_FREEZABLE_UNSAFE			(0x2000 * IS_ENABLED(CONFIG_LOCKDEP))

?

> > +#define TASK_FROZEN			0x4000
> > +#define TASK_STATE_MAX			0x8000
> > +
> > +#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
> 
> We probably want to preserve the "DO NOT ADD ANY NEW CALLERS OF THIS STATE"
> comment for the unsafe stuff.

Done.

> > +/* Recursion relies on tail-call optimization to not blow away the stack */
> > +static bool __frozen(struct task_struct *p)
> > +{
> > +	if (p->state == TASK_FROZEN)
> > +		return true;
> 
> READ_ONCE()?

task_struct::state is volatile -- for now. I've got other patches to
deal with that.

> > +
> > +	/*
> > +	 * If stuck in TRACED, and the ptracer is FROZEN, we're frozen too.
> > +	 */
> > +	if (task_is_traced(p))
> > +		return frozen(rcu_dereference(p->parent));
> > +
> > +	/*
> > +	 * If stuck in STOPPED and the parent is FROZEN, we're frozen too.
> > +	 */
> > +	if (task_is_stopped(p))
> > +		return frozen(rcu_dereference(p->real_parent));
> 
> This looks convincing, but I really can't tell if we're missing anything.

Yeah, Oleg would be the one to tell us I suppose.

> > +static bool __freeze_task(struct task_struct *p)
> > +{
> > +	unsigned long flags;
> > +	unsigned int state;
> > +	bool frozen = false;
> > +
> > +	raw_spin_lock_irqsave(&p->pi_lock, flags);
> > +	state = READ_ONCE(p->state);
> > +	if (state & TASK_FREEZABLE) {
> > +		/*
> > +		 * Only TASK_NORMAL can be augmented with TASK_FREEZABLE,
> > +		 * since they can suffer spurious wakeups.
> > +		 */
> > +		WARN_ON_ONCE(!(state & TASK_NORMAL));
> > +
> > +#ifdef CONFIG_LOCKDEP
> > +		/*
> > +		 * It's dangerous to freeze with locks held; there be dragons there.
> > +		 */
> > +		if (!(state & __TASK_FREEZABLE_UNSAFE))
> > +			WARN_ON_ONCE(debug_locks && p->lockdep_depth);
> > +#endif
> > +
> > +		p->state = TASK_FROZEN;
> > +		frozen = true;
> > +	}
> > +	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> > +
> > +	return frozen;
> > +}
> > +
> >  /**
> >   * freeze_task - send a freeze request to given task
> >   * @p: task to send the request to
> > @@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
> >  {
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&freezer_lock, flags);
> > +	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
> >  		spin_unlock_irqrestore(&freezer_lock, flags);
> >  		return false;
> >  	}
> 
> I've been trying to figure out how this serialises with ttwu(), given that
> frozen(p) will go and read p->state. I suppose it works out because only the
> freezer can wake up tasks from the FROZEN state, but it feels a bit brittle.

p->pi_lock; both ttwu() and __freeze_task() (which is essentially a
variant of set_special_state()) take ->pi_lock. I'll put in a comment.

> > @@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
> >  	if (!(p->flags & PF_KTHREAD))
> >  		fake_signal_wake_up(p);
> >  	else
> > -		wake_up_state(p, TASK_INTERRUPTIBLE);
> > +		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
> >  
> >  	spin_unlock_irqrestore(&freezer_lock, flags);
> >  	return true;
> > @@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&freezer_lock, flags);
> > -	if (frozen(p))
> > -		wake_up_process(p);
> > +	WARN_ON_ONCE(freezing(p));
> > +	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
> 
> Why do we need TASK_NORMAL here?

It's a left-over from hacking, but I left it in because anything
TASK_NORMAL should be able to deal with spuriuos wakeups, something
try_to_freeze() now also relies on.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-03 10:35         ` Peter Zijlstra
@ 2021-06-03 10:58           ` Will Deacon
  2021-06-03 11:26             ` Peter Zijlstra
  0 siblings, 1 reply; 57+ messages in thread
From: Will Deacon @ 2021-06-03 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Oleg Nesterov

On Thu, Jun 03, 2021 at 12:35:22PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 02, 2021 at 01:54:53PM +0100, Will Deacon wrote:
> 
> > There's also Documentation/power/freezing-of-tasks.rst to update. I'm not
> 
> Since it's .rst, the only update I'm willing to do is delete it
> outright.

Hah! Well, don't do that.

> > sure if fs/proc/array.c should be updated to display frozen tasks; I
> > couldn't see how that was useful, but thought I'd mention it anyway.
> 
> Yeah, I considered it too, but I figured that if we're all frozen
> there's noone left to observe us being frozen, so I didn't bother.

Agreed.

> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 2982cfab1ae9..bfadc1dbcf24 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -95,7 +95,12 @@ struct task_group;
> > >  #define TASK_WAKING			0x0200
> > >  #define TASK_NOLOAD			0x0400
> > >  #define TASK_NEW			0x0800
> > > -#define TASK_STATE_MAX			0x1000
> > > +#define TASK_FREEZABLE			0x1000
> > > +#define __TASK_FREEZABLE_UNSAFE		0x2000
> > 
> > Give that this is only needed to avoid lockdep checks, maybe we should avoid
> > allocating the bit if lockdep is not enabled? Otherwise, people might start
> > to use it for other things.
> 
> Something like
> 
> #define __TASK_FREEZABLE_UNSAFE			(0x2000 * IS_ENABLED(CONFIG_LOCKDEP))
> 
> ?

Yup.

> > > +#define TASK_FROZEN			0x4000
> > > +#define TASK_STATE_MAX			0x8000
> > > +
> > > +#define TASK_FREEZABLE_UNSAFE		(TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE)
> > 
> > We probably want to preserve the "DO NOT ADD ANY NEW CALLERS OF THIS STATE"
> > comment for the unsafe stuff.
> 
> Done.

Thanks.

> > > +/* Recursion relies on tail-call optimization to not blow away the stack */
> > > +static bool __frozen(struct task_struct *p)
> > > +{
> > > +	if (p->state == TASK_FROZEN)
> > > +		return true;
> > 
> > READ_ONCE()?
> 
> task_struct::state is volatile -- for now. I've got other patches to
> deal with that.

Thanks, I missed that and have since reviewed your other series.

> > > @@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
> > >  {
> > >  	unsigned long flags;
> > >  
> > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > +	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
> > >  		spin_unlock_irqrestore(&freezer_lock, flags);
> > >  		return false;
> > >  	}
> > 
> > I've been trying to figure out how this serialises with ttwu(), given that
> > frozen(p) will go and read p->state. I suppose it works out because only the
> > freezer can wake up tasks from the FROZEN state, but it feels a bit brittle.
> 
> p->pi_lock; both ttwu() and __freeze_task() (which is essentially a
> variant of set_special_state()) take ->pi_lock. I'll put in a comment.

The part I struggled with was freeze_task(), which doesn't take ->pi_lock
yet calls frozen(p).

> 
> > > @@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
> > >  	if (!(p->flags & PF_KTHREAD))
> > >  		fake_signal_wake_up(p);
> > >  	else
> > > -		wake_up_state(p, TASK_INTERRUPTIBLE);
> > > +		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
> > >  
> > >  	spin_unlock_irqrestore(&freezer_lock, flags);
> > >  	return true;
> > > @@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
> > >  	unsigned long flags;
> > >  
> > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > -	if (frozen(p))
> > > -		wake_up_process(p);
> > > +	WARN_ON_ONCE(freezing(p));
> > > +	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
> > 
> > Why do we need TASK_NORMAL here?
> 
> It's a left-over from hacking, but I left it in because anything
> TASK_NORMAL should be able to deal with spuriuos wakeups, something
> try_to_freeze() now also relies on.

I just worry that it might hide bugs if TASK_FROZEN is supposed to be
sufficient, as it would imply that we have some unfrozen tasks kicking
around. I dunno, maybe just a comment saying that everything _should_ be
FROZEN at this point?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-03 10:58           ` Will Deacon
@ 2021-06-03 11:26             ` Peter Zijlstra
  2021-06-03 11:36               ` Will Deacon
  0 siblings, 1 reply; 57+ messages in thread
From: Peter Zijlstra @ 2021-06-03 11:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Oleg Nesterov

On Thu, Jun 03, 2021 at 11:58:56AM +0100, Will Deacon wrote:
> On Thu, Jun 03, 2021 at 12:35:22PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 02, 2021 at 01:54:53PM +0100, Will Deacon wrote:

> > > > @@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
> > > >  {
> > > >  	unsigned long flags;
> > > >  
> > > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > > +	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
> > > >  		spin_unlock_irqrestore(&freezer_lock, flags);
> > > >  		return false;
> > > >  	}
> > > 
> > > I've been trying to figure out how this serialises with ttwu(), given that
> > > frozen(p) will go and read p->state. I suppose it works out because only the
> > > freezer can wake up tasks from the FROZEN state, but it feels a bit brittle.
> > 
> > p->pi_lock; both ttwu() and __freeze_task() (which is essentially a
> > variant of set_special_state()) take ->pi_lock. I'll put in a comment.
> 
> The part I struggled with was freeze_task(), which doesn't take ->pi_lock
> yet calls frozen(p).

Ah, I can't read... I assumed you were asking about __freeze_task().

So frozen(p) checks for p->state == TASK_FROZEN (and complicated), which
is a stable state. Once you're frozen you stay frozen until thaw, which
is after freezing per construction.

The tricky bit is __freeze_task(), that does take pi_lock. It checks for
FREEZABLE and if set, changes to FROZEN. And this does in fact race with
ttwu() and relies on pi_lock to serialize.

A concurrent wakeup (from a non-frozen task) can try and wake the task
we're trying to freeze. If we win, ttwu will see FROZEN and ignore, if
ttwu() wins, we don't see FREEZABLE and do another round of freezing.

> > > > @@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
> > > >  	if (!(p->flags & PF_KTHREAD))
> > > >  		fake_signal_wake_up(p);
> > > >  	else
> > > > -		wake_up_state(p, TASK_INTERRUPTIBLE);
> > > > +		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
> > > >  
> > > >  	spin_unlock_irqrestore(&freezer_lock, flags);
> > > >  	return true;
> > > > @@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
> > > >  	unsigned long flags;
> > > >  
> > > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > > -	if (frozen(p))
> > > > -		wake_up_process(p);
> > > > +	WARN_ON_ONCE(freezing(p));
> > > > +	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
> > > 
> > > Why do we need TASK_NORMAL here?
> > 
> > It's a left-over from hacking, but I left it in because anything
> > TASK_NORMAL should be able to deal with spuriuos wakeups, something
> > try_to_freeze() now also relies on.
> 
> I just worry that it might hide bugs if TASK_FROZEN is supposed to be
> sufficient, as it would imply that we have some unfrozen tasks kicking
> around. I dunno, maybe just a comment saying that everything _should_ be
> FROZEN at this point?

I'll take it out. It really shouldn't matter.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC][PATCH] freezer,sched: Rewrite core freezer logic
  2021-06-03 11:26             ` Peter Zijlstra
@ 2021-06-03 11:36               ` Will Deacon
  0 siblings, 0 replies; 57+ messages in thread
From: Will Deacon @ 2021-06-03 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arm-kernel, linux-arch, linux-kernel, Catalin Marinas,
	Marc Zyngier, Greg Kroah-Hartman, Morten Rasmussen, Qais Yousef,
	Suren Baghdasaryan, Quentin Perret, Tejun Heo, Johannes Weiner,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Rafael J. Wysocki,
	Dietmar Eggemann, Daniel Bristot de Oliveira, kernel-team,
	Oleg Nesterov

On Thu, Jun 03, 2021 at 01:26:06PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 03, 2021 at 11:58:56AM +0100, Will Deacon wrote:
> > On Thu, Jun 03, 2021 at 12:35:22PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 02, 2021 at 01:54:53PM +0100, Will Deacon wrote:
> 
> > > > > @@ -116,20 +173,8 @@ bool freeze_task(struct task_struct *p)
> > > > >  {
> > > > >  	unsigned long flags;
> > > > >  
> > > > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > > > +	if (!freezing(p) || frozen(p) || __freeze_task(p)) {
> > > > >  		spin_unlock_irqrestore(&freezer_lock, flags);
> > > > >  		return false;
> > > > >  	}
> > > > 
> > > > I've been trying to figure out how this serialises with ttwu(), given that
> > > > frozen(p) will go and read p->state. I suppose it works out because only the
> > > > freezer can wake up tasks from the FROZEN state, but it feels a bit brittle.
> > > 
> > > p->pi_lock; both ttwu() and __freeze_task() (which is essentially a
> > > variant of set_special_state()) take ->pi_lock. I'll put in a comment.
> > 
> > The part I struggled with was freeze_task(), which doesn't take ->pi_lock
> > yet calls frozen(p).
> 
> Ah, I can't read... I assumed you were asking about __freeze_task().
> 
> So frozen(p) checks for p->state == TASK_FROZEN (and complicated), which
> is a stable state. Once you're frozen you stay frozen until thaw, which
> is after freezing per construction.
> 
> The tricky bit is __freeze_task(), that does take pi_lock. It checks for
> FREEZABLE and if set, changes to FROZEN. And this does in fact race with
> ttwu() and relies on pi_lock to serialize.
> 
> A concurrent wakeup (from a non-frozen task) can try and wake the task
> we're trying to freeze. If we win, ttwu will see FROZEN and ignore, if
> ttwu() wins, we don't see FREEZABLE and do another round of freezing.

Good, thanks. That's where I'd ended up. It means that nobody else better
try waking up FROZEN tasks!

> > > > > @@ -137,7 +182,7 @@ bool freeze_task(struct task_struct *p)
> > > > >  	if (!(p->flags & PF_KTHREAD))
> > > > >  		fake_signal_wake_up(p);
> > > > >  	else
> > > > > -		wake_up_state(p, TASK_INTERRUPTIBLE);
> > > > > +		wake_up_state(p, TASK_INTERRUPTIBLE); // TASK_NORMAL ?!?
> > > > >  
> > > > >  	spin_unlock_irqrestore(&freezer_lock, flags);
> > > > >  	return true;
> > > > > @@ -148,8 +193,8 @@ void __thaw_task(struct task_struct *p)
> > > > >  	unsigned long flags;
> > > > >  
> > > > >  	spin_lock_irqsave(&freezer_lock, flags);
> > > > > -	if (frozen(p))
> > > > > -		wake_up_process(p);
> > > > > +	WARN_ON_ONCE(freezing(p));
> > > > > +	wake_up_state(p, TASK_FROZEN | TASK_NORMAL);
> > > > 
> > > > Why do we need TASK_NORMAL here?
> > > 
> > > It's a left-over from hacking, but I left it in because anything
> > > TASK_NORMAL should be able to deal with spuriuos wakeups, something
> > > try_to_freeze() now also relies on.
> > 
> > I just worry that it might hide bugs if TASK_FROZEN is supposed to be
> > sufficient, as it would imply that we have some unfrozen tasks kicking
> > around. I dunno, maybe just a comment saying that everything _should_ be
> > FROZEN at this point?
> 
> I'll take it out. It really shouldn't matter.

Perfect.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2021-06-03 11:37 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25 15:14 [PATCH v7 00/22] Add support for 32-bit tasks on asymmetric AArch32 systems Will Deacon
2021-05-25 15:14 ` [PATCH v7 01/22] sched: Favour predetermined active CPU as migration destination Will Deacon
2021-05-26 11:14   ` Valentin Schneider
2021-05-26 12:32     ` Peter Zijlstra
2021-05-26 12:36       ` Valentin Schneider
2021-05-26 16:03     ` Will Deacon
2021-05-26 17:46       ` Valentin Schneider
2021-05-25 15:14 ` [PATCH v7 02/22] arm64: cpuinfo: Split AArch32 registers out into a separate struct Will Deacon
2021-05-25 15:14 ` [PATCH v7 03/22] arm64: Allow mismatched 32-bit EL0 support Will Deacon
2021-05-25 15:14 ` [PATCH v7 04/22] KVM: arm64: Kill 32-bit vCPUs on systems with mismatched " Will Deacon
2021-05-25 15:14 ` [PATCH v7 05/22] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs Will Deacon
2021-05-25 15:14 ` [PATCH v7 06/22] arm64: Advertise CPUs capable of running 32-bit applications in sysfs Will Deacon
2021-05-25 15:14 ` [PATCH v7 07/22] sched: Introduce task_cpu_possible_mask() to limit fallback rq selection Will Deacon
2021-05-25 15:14 ` [PATCH v7 08/22] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 Will Deacon
2021-05-26 15:02   ` Peter Zijlstra
2021-05-26 16:07     ` Will Deacon
2021-05-25 15:14 ` [PATCH v7 09/22] cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() Will Deacon
2021-05-25 15:14 ` [PATCH v7 10/22] sched: Reject CPU affinity changes based on task_cpu_possible_mask() Will Deacon
2021-05-26 15:15   ` Peter Zijlstra
2021-05-26 16:12     ` Will Deacon
2021-05-26 17:56       ` Peter Zijlstra
2021-05-26 18:59         ` Will Deacon
2021-05-25 15:14 ` [PATCH v7 11/22] sched: Introduce task_struct::user_cpus_ptr to track requested affinity Will Deacon
2021-05-25 15:14 ` [PATCH v7 12/22] sched: Split the guts of sched_setaffinity() into a helper function Will Deacon
2021-05-25 15:14 ` [PATCH v7 13/22] sched: Allow task CPU affinity to be restricted on asymmetric systems Will Deacon
2021-05-26 16:20   ` Peter Zijlstra
2021-05-26 16:35     ` Will Deacon
2021-05-26 16:30   ` Peter Zijlstra
2021-05-26 17:02     ` Will Deacon
2021-05-27  7:56       ` Peter Zijlstra
2021-05-25 15:14 ` [PATCH v7 14/22] sched: Introduce task_cpus_dl_admissible() to check proposed affinity Will Deacon
2021-05-25 15:14 ` [PATCH v7 15/22] freezer: Add frozen_or_skipped() helper function Will Deacon
2021-05-25 15:14 ` [PATCH v7 16/22] sched: Defer wakeup in ttwu() for unschedulable frozen tasks Will Deacon
2021-05-27 14:10   ` Peter Zijlstra
2021-05-27 14:31     ` Peter Zijlstra
2021-05-27 14:44       ` Will Deacon
2021-05-27 14:55         ` Peter Zijlstra
2021-05-27 14:50       ` Peter Zijlstra
2021-05-28 10:49       ` Peter Zijlstra
2021-05-27 14:36     ` Will Deacon
2021-06-01  8:21   ` [RFC][PATCH] freezer,sched: Rewrite core freezer logic Peter Zijlstra
2021-06-01 11:27     ` Peter Zijlstra
2021-06-02 12:54       ` Will Deacon
2021-06-03 10:35         ` Peter Zijlstra
2021-06-03 10:58           ` Will Deacon
2021-06-03 11:26             ` Peter Zijlstra
2021-06-03 11:36               ` Will Deacon
2021-05-25 15:14 ` [PATCH v7 17/22] arm64: Implement task_cpu_possible_mask() Will Deacon
2021-05-25 15:14 ` [PATCH v7 18/22] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0 Will Deacon
2021-05-25 15:14 ` [PATCH v7 19/22] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system Will Deacon
2021-05-25 15:14 ` [PATCH v7 20/22] arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0 Will Deacon
2021-05-25 15:14 ` [PATCH v7 21/22] arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores Will Deacon
2021-05-25 15:14 ` [PATCH v7 22/22] Documentation: arm64: describe asymmetric 32-bit support Will Deacon
2021-05-25 17:13   ` Marc Zyngier
2021-05-25 17:27     ` Will Deacon
2021-05-25 18:11       ` Marc Zyngier
2021-05-26 16:00         ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).