All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-17  0:36 ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-03-17  0:36 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: peterz, yury.norov, daniel.m.jordan, tglx, jobaker,
	audralmitchel, arnd, gregkh, rafael, tj, qais.yousef, hannes,
	klimov.linux

When a CPU offlined and onlined via device_offline() and device_online()
the userspace gets uevent notification. If, after receiving "online" uevent,
userspace executes sched_setaffinity() on some task trying to move it
to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
needs to wait around 5..30 ms before sched_setaffinity() will succeed for
recently onlined CPU after receiving uevent.

If in_mask argument for sched_setaffinity() has only recently onlined CPU,
it could fail with such flow:

  sched_setaffinity()
    cpuset_cpus_allowed()
      guarantee_online_cpus()   <-- cs->effective_cpus mask does not
                                        contain recently onlined cpu
    cpumask_and()               <-- final new_mask is empty
    __set_cpus_allowed_ptr()
      cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
      returns -EINVAL

Cpusets used in guarantee_online_cpus() are updated using workqueue from
cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
it is called immediately after uevent.

Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
has run to completion using cpuset_wait_for_hotplug() after onlining the
cpu in cpu_device_up() and in cpuhp_smt_enable().

Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Co-analyzed-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Alexey Klimov <aklimov@redhat.com>
---

Changes since v2:
	- restore cpuhp_{online,offline}_cpu_device back and move it out
		of cpu maps lock;
	- use Reviewed-by from Qais;
	- minor corrections in commit message and in comment in code.

Changes since v1:
	- cpuset_wait_for_hotplug() moved to cpu_device_up();
	- corrections in comments;
	- removed cpuhp_{online,offline}_cpu_device.

Changes since RFC:
	- cpuset_wait_for_hotplug() used in cpuhp_smt_enable().

Previous patches and discussion are:
RFC patch: https://lore.kernel.org/lkml/20201203171431.256675-1-aklimov@redhat.com/
v1 patch:  https://lore.kernel.org/lkml/20210204010157.1823669-1-aklimov@redhat.com/
v2 patch: https://lore.kernel.org/lkml/20210212003032.2037750-1-aklimov@redhat.com/

The commit a49e4629b5ed "cpuset: Make cpuset hotplug synchronous"
would also get rid of the early uevent but it was reverted (deadlocks).

The nature of this bug is also described here (with different consequences):
https://lore.kernel.org/lkml/20200211141554.24181-1-qais.yousef@arm.com/

Reproducer: https://gitlab.com/0xeafffffe/xlam

Currently with such changes the reproducer code continues to work without issues.
The idea is to avoid the situation when userspace receives the event about
onlined CPU which is not ready to take tasks for a while after uevent.

 kernel/cpu.c | 74 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 18 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302ecbabe..9b091d8a8811 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,6 +15,7 @@
 #include <linux/sched/smt.h>
 #include <linux/unistd.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/oom.h>
 #include <linux/rcupdate.h>
 #include <linux/export.h>
@@ -1301,7 +1302,17 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
  */
 int cpu_device_up(struct device *dev)
 {
-	return cpu_up(dev->id, CPUHP_ONLINE);
+	int err;
+
+	err = cpu_up(dev->id, CPUHP_ONLINE);
+	/*
+	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
+	 * may generate uevents whose consumers rely on the updates.
+	 */
+	if (!err)
+		cpuset_wait_for_hotplug();
+
+	return err;
 }
 
 int add_cpu(unsigned int cpu)
@@ -2084,8 +2095,13 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
 
 int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 {
-	int cpu, ret = 0;
+	cpumask_var_t mask;
+	int cpu, ret;
 
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = 0;
 	cpu_maps_update_begin();
 	for_each_online_cpu(cpu) {
 		if (topology_is_primary_thread(cpu))
@@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
 		if (ret)
 			break;
-		/*
-		 * As this needs to hold the cpu maps lock it's impossible
-		 * to call device_offline() because that ends up calling
-		 * cpu_down() which takes cpu maps lock. cpu maps lock
-		 * needs to be held as this might race against in kernel
-		 * abusers of the hotplug machinery (thermal management).
-		 *
-		 * So nothing would update device:offline state. That would
-		 * leave the sysfs entry stale and prevent onlining after
-		 * smt control has been changed to 'off' again. This is
-		 * called under the sysfs hotplug lock, so it is properly
-		 * serialized against the regular offline usage.
-		 */
-		cpuhp_offline_cpu_device(cpu);
+
+		cpumask_set_cpu(cpu, mask);
 	}
 	if (!ret)
 		cpu_smt_control = ctrlval;
 	cpu_maps_update_done();
+
+	/*
+	 * When the cpu maps lock was taken above it was impossible
+	 * to call device_offline() because that ends up calling
+	 * cpu_down() which takes cpu maps lock. cpu maps lock
+	 * needed to be held as this might race against in-kernel
+	 * abusers of the hotplug machinery (thermal management).
+	 *
+	 * So nothing would update device:offline state. That would
+	 * leave the sysfs entry stale and prevent onlining after
+	 * smt control has been changed to 'off' again. This is
+	 * called under the sysfs hotplug lock, so it is properly
+	 * serialized against the regular offline usage.
+	 */
+	for_each_cpu(cpu, mask)
+		cpuhp_offline_cpu_device(cpu);
+
+	free_cpumask_var(mask);
 	return ret;
 }
 
 int cpuhp_smt_enable(void)
 {
-	int cpu, ret = 0;
+	cpumask_var_t mask;
+	int cpu, ret;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
 
+	ret = 0;
 	cpu_maps_update_begin();
 	cpu_smt_control = CPU_SMT_ENABLED;
 	for_each_present_cpu(cpu) {
@@ -2128,9 +2155,20 @@ int cpuhp_smt_enable(void)
 		if (ret)
 			break;
 		/* See comment in cpuhp_smt_disable() */
-		cpuhp_online_cpu_device(cpu);
+		cpumask_set_cpu(cpu, mask);
 	}
 	cpu_maps_update_done();
+
+	/*
+	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
+	 * may generate uevents whose consumers rely on the updates.
+	 */
+	cpuset_wait_for_hotplug();
+
+	for_each_cpu(cpu, mask)
+		cpuhp_online_cpu_device(cpu);
+
+	free_cpumask_var(mask);
 	return ret;
 }
 #endif
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-17  0:36 ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-03-17  0:36 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, yury.norov-Re5JQEeQqe8AvxtiuMwx3w,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, jobaker-H+wXaHxf7aLQT0dZR+AlfA,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, arnd-r2nGTMty4D4,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	qais.yousef-5wv7dgnIgG8, hannes-druUgvl0LCNAfugRpC6u6w,
	klimov.linux-Re5JQEeQqe8AvxtiuMwx3w

When a CPU offlined and onlined via device_offline() and device_online()
the userspace gets uevent notification. If, after receiving "online" uevent,
userspace executes sched_setaffinity() on some task trying to move it
to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
needs to wait around 5..30 ms before sched_setaffinity() will succeed for
recently onlined CPU after receiving uevent.

If in_mask argument for sched_setaffinity() has only recently onlined CPU,
it could fail with such flow:

  sched_setaffinity()
    cpuset_cpus_allowed()
      guarantee_online_cpus()   <-- cs->effective_cpus mask does not
                                        contain recently onlined cpu
    cpumask_and()               <-- final new_mask is empty
    __set_cpus_allowed_ptr()
      cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
      returns -EINVAL

Cpusets used in guarantee_online_cpus() are updated using workqueue from
cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
it is called immediately after uevent.

Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
has run to completion using cpuset_wait_for_hotplug() after onlining the
cpu in cpu_device_up() and in cpuhp_smt_enable().

Cc: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Qais Yousef <qais.yousef-5wv7dgnIgG8@public.gmane.org>
Co-analyzed-by: Joshua Baker <jobaker-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

Changes since v2:
	- restore cpuhp_{online,offline}_cpu_device back and move it out
		of cpu maps lock;
	- use Reviewed-by from Qais;
	- minor corrections in commit message and in comment in code.

Changes since v1:
	- cpuset_wait_for_hotplug() moved to cpu_device_up();
	- corrections in comments;
	- removed cpuhp_{online,offline}_cpu_device.

Changes since RFC:
	- cpuset_wait_for_hotplug() used in cpuhp_smt_enable().

Previous patches and discussion are:
RFC patch: https://lore.kernel.org/lkml/20201203171431.256675-1-aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org/
v1 patch:  https://lore.kernel.org/lkml/20210204010157.1823669-1-aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org/
v2 patch: https://lore.kernel.org/lkml/20210212003032.2037750-1-aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org/

The commit a49e4629b5ed "cpuset: Make cpuset hotplug synchronous"
would also get rid of the early uevent but it was reverted (deadlocks).

The nature of this bug is also described here (with different consequences):
https://lore.kernel.org/lkml/20200211141554.24181-1-qais.yousef-5wv7dgnIgG8@public.gmane.org/

Reproducer: https://gitlab.com/0xeafffffe/xlam

Currently with such changes the reproducer code continues to work without issues.
The idea is to avoid the situation when userspace receives the event about
onlined CPU which is not ready to take tasks for a while after uevent.

 kernel/cpu.c | 74 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 18 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1b6302ecbabe..9b091d8a8811 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,6 +15,7 @@
 #include <linux/sched/smt.h>
 #include <linux/unistd.h>
 #include <linux/cpu.h>
+#include <linux/cpuset.h>
 #include <linux/oom.h>
 #include <linux/rcupdate.h>
 #include <linux/export.h>
@@ -1301,7 +1302,17 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
  */
 int cpu_device_up(struct device *dev)
 {
-	return cpu_up(dev->id, CPUHP_ONLINE);
+	int err;
+
+	err = cpu_up(dev->id, CPUHP_ONLINE);
+	/*
+	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
+	 * may generate uevents whose consumers rely on the updates.
+	 */
+	if (!err)
+		cpuset_wait_for_hotplug();
+
+	return err;
 }
 
 int add_cpu(unsigned int cpu)
@@ -2084,8 +2095,13 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
 
 int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 {
-	int cpu, ret = 0;
+	cpumask_var_t mask;
+	int cpu, ret;
 
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	ret = 0;
 	cpu_maps_update_begin();
 	for_each_online_cpu(cpu) {
 		if (topology_is_primary_thread(cpu))
@@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
 		if (ret)
 			break;
-		/*
-		 * As this needs to hold the cpu maps lock it's impossible
-		 * to call device_offline() because that ends up calling
-		 * cpu_down() which takes cpu maps lock. cpu maps lock
-		 * needs to be held as this might race against in kernel
-		 * abusers of the hotplug machinery (thermal management).
-		 *
-		 * So nothing would update device:offline state. That would
-		 * leave the sysfs entry stale and prevent onlining after
-		 * smt control has been changed to 'off' again. This is
-		 * called under the sysfs hotplug lock, so it is properly
-		 * serialized against the regular offline usage.
-		 */
-		cpuhp_offline_cpu_device(cpu);
+
+		cpumask_set_cpu(cpu, mask);
 	}
 	if (!ret)
 		cpu_smt_control = ctrlval;
 	cpu_maps_update_done();
+
+	/*
+	 * When the cpu maps lock was taken above it was impossible
+	 * to call device_offline() because that ends up calling
+	 * cpu_down() which takes cpu maps lock. cpu maps lock
+	 * needed to be held as this might race against in-kernel
+	 * abusers of the hotplug machinery (thermal management).
+	 *
+	 * So nothing would update device:offline state. That would
+	 * leave the sysfs entry stale and prevent onlining after
+	 * smt control has been changed to 'off' again. This is
+	 * called under the sysfs hotplug lock, so it is properly
+	 * serialized against the regular offline usage.
+	 */
+	for_each_cpu(cpu, mask)
+		cpuhp_offline_cpu_device(cpu);
+
+	free_cpumask_var(mask);
 	return ret;
 }
 
 int cpuhp_smt_enable(void)
 {
-	int cpu, ret = 0;
+	cpumask_var_t mask;
+	int cpu, ret;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
 
+	ret = 0;
 	cpu_maps_update_begin();
 	cpu_smt_control = CPU_SMT_ENABLED;
 	for_each_present_cpu(cpu) {
@@ -2128,9 +2155,20 @@ int cpuhp_smt_enable(void)
 		if (ret)
 			break;
 		/* See comment in cpuhp_smt_disable() */
-		cpuhp_online_cpu_device(cpu);
+		cpumask_set_cpu(cpu, mask);
 	}
 	cpu_maps_update_done();
+
+	/*
+	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
+	 * may generate uevents whose consumers rely on the updates.
+	 */
+	cpuset_wait_for_hotplug();
+
+	for_each_cpu(cpu, mask)
+		cpuhp_online_cpu_device(cpu);
+
+	free_cpumask_var(mask);
 	return ret;
 }
 #endif
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-18 19:28   ` Daniel Jordan
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Jordan @ 2021-03-18 19:28 UTC (permalink / raw)
  To: Alexey Klimov, linux-kernel, cgroups
  Cc: peterz, yury.norov, tglx, jobaker, audralmitchel, arnd, gregkh,
	rafael, tj, qais.yousef, hannes, klimov.linux

Alexey Klimov <aklimov@redhat.com> writes:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it could fail with such flow:
>
>   sched_setaffinity()
>     cpuset_cpus_allowed()
>       guarantee_online_cpus()   <-- cs->effective_cpus mask does not
>                                         contain recently onlined cpu
>     cpumask_and()               <-- final new_mask is empty
>     __set_cpus_allowed_ptr()
>       cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>       returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
>
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Co-analyzed-by: Joshua Baker <jobaker@redhat.com>
> Signed-off-by: Alexey Klimov <aklimov@redhat.com>

Looks good to me.

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-18 19:28   ` Daniel Jordan
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Jordan @ 2021-03-18 19:28 UTC (permalink / raw)
  To: Alexey Klimov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, yury.norov-Re5JQEeQqe8AvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ, jobaker-H+wXaHxf7aLQT0dZR+AlfA,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, arnd-r2nGTMty4D4,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	qais.yousef-5wv7dgnIgG8, hannes-druUgvl0LCNAfugRpC6u6w,
	klimov.linux-Re5JQEeQqe8AvxtiuMwx3w

Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it could fail with such flow:
>
>   sched_setaffinity()
>     cpuset_cpus_allowed()
>       guarantee_online_cpus()   <-- cs->effective_cpus mask does not
>                                         contain recently onlined cpu
>     cpumask_and()               <-- final new_mask is empty
>     __set_cpus_allowed_ptr()
>       cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>       returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
>
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> Cc: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Qais Yousef <qais.yousef-5wv7dgnIgG8@public.gmane.org>
> Co-analyzed-by: Joshua Baker <jobaker-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Looks good to me.

Reviewed-by: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-27 21:01   ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-27 21:01 UTC (permalink / raw)
  To: Alexey Klimov, linux-kernel, cgroups
  Cc: peterz, yury.norov, daniel.m.jordan, jobaker, audralmitchel,
	arnd, gregkh, rafael, tj, qais.yousef, hannes, klimov.linux

Alexey,

On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.

And because cpusets are using a workqueue just to deal with their
backwards lock order we need to cure the symptom in the CPU hotplug
code, right?

> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().

It can also be avoided by fixing the root cause which is _NOT_ in the
CPU hotplug code at all.

The fundamental assumption of CPU hotplug is that if the state machine
reaches a given state, which might have user space visible effects or
even just kernel visible effects, the overall state of the system has to
be consistent.

cpusets violate this assumption. And they do so since 2013 due to commit
3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").

If that cannot be fixed in cgroups/cpusets with out rewriting the whole
cpusets/cgroups muck, then this want's to be explained and justified in the
changelog.

Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
is non trivial because that changelog clearly states that the lock order
is a design decision and that design decision required that workqueue
workaround....

See? Now we suddenly have a proper root cause and not just a description
of the symptom with some hidden hint about workqueues. And we have an
argument why fixing the root cause is close to impossible.

>  int cpu_device_up(struct device *dev)
>  {
> -	return cpu_up(dev->id, CPUHP_ONLINE);
> +	int err;
> +
> +	err = cpu_up(dev->id, CPUHP_ONLINE);
> +	/*
> +	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
> +	 * may generate uevents whose consumers rely on the updates.
> +	 */
> +	if (!err)
> +		cpuset_wait_for_hotplug();

No. Quite some people wasted^Wspent a considerable amount of time to get
the hotplug trainwreck cleaned up and we are not sprinkling random
workarounds all over the place again.

>  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  {
> -	int cpu, ret = 0;
> +	cpumask_var_t mask;
> +	int cpu, ret;
>  
> +	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	ret = 0;
>  	cpu_maps_update_begin();
>  	for_each_online_cpu(cpu) {
>  		if (topology_is_primary_thread(cpu))
> @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
>  		if (ret)
>  			break;
> -		/*
> -		 * As this needs to hold the cpu maps lock it's impossible
> -		 * to call device_offline() because that ends up calling
> -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> -		 * needs to be held as this might race against in kernel
> -		 * abusers of the hotplug machinery (thermal management).
> -		 *
> -		 * So nothing would update device:offline state. That would
> -		 * leave the sysfs entry stale and prevent onlining after
> -		 * smt control has been changed to 'off' again. This is
> -		 * called under the sysfs hotplug lock, so it is properly
> -		 * serialized against the regular offline usage.
> -		 */
> -		cpuhp_offline_cpu_device(cpu);
> +
> +		cpumask_set_cpu(cpu, mask);
>  	}
>  	if (!ret)
>  		cpu_smt_control = ctrlval;
>  	cpu_maps_update_done();
> +
> +	/*
> +	 * When the cpu maps lock was taken above it was impossible
> +	 * to call device_offline() because that ends up calling
> +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> +	 * needed to be held as this might race against in-kernel
> +	 * abusers of the hotplug machinery (thermal management).
> +	 *
> +	 * So nothing would update device:offline state. That would
> +	 * leave the sysfs entry stale and prevent onlining after
> +	 * smt control has been changed to 'off' again. This is
> +	 * called under the sysfs hotplug lock, so it is properly
> +	 * serialized against the regular offline usage.
> +	 */
> +	for_each_cpu(cpu, mask)
> +		cpuhp_offline_cpu_device(cpu);
> +
> +	free_cpumask_var(mask);
>  	return ret;

Brilliant stuff that. All this does is to move the uevent out of the cpu
maps locked region. What for? There is no wait here? Why?

The work is scheduled whenever the active state of a CPU is changed. And
just because on unplug it's not triggering any test program failure
(yet) does not make it more correct. The uevent is again delivered
before consistent state is reached.

And to be clear, it's not about uevent at all. That's the observable
side of it.

The point is that the state is inconsistent when the hotplug functions
return. That's the only really interesting statement here.

And while you carefully reworded the comment, did you actually read what
it said and what is says now?

> -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> -		 * needs to be held as this might race against in kernel
> -		 * abusers of the hotplug machinery (thermal management).

vs.

> +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> +	 * needed to be held as this might race against in-kernel
> +	 * abusers of the hotplug machinery (thermal management).

The big fat hint is: "cpu maps lock needs to be held as this ...." and
it still needs to be held for the above loop to work correctly at
all. See also below.

So just moving comments blindly around and making them past tense is not
cutting it. Quite the contrary the comments make no sense anymore. They
became uncomprehensible word salad.

Now for the second part of that comment:

> +      *                                          ....  This is
> +	 * called under the sysfs hotplug lock, so it is properly
> +	 * serialized against the regular offline usage.

So there are two layers of protection:

   cpu_maps_lock and sysfs hotplug lock

One protects surprisingly against concurrent sysfs writes and the other
is required to serialize in kernel usage.

Now lets look at the original protection:

   lock(sysfs)
     lock(cpu_maps)
       hotplug
        dev->offline = new_state
        uevent()
     unlock(cpu_maps)
   unlock(sysfs)

and the one you created:

   lock(sysfs)
     lock(cpu_maps)
       hotplug
     unlock(cpu_maps)
     dev->offline = new_state
     uevent()
   unlock(sysfs)

Where is that protection scope change mentioned in the change log and
where is the analysis that it is correct?

Now you also completely fail to explain _why_ the uevent and the
wait_for() muck have to be done _outside_ of the cpu maps locked region.

workqueues and cpusets/cgroups have exactly zero lock relationship to
the map lock.

The actual problematic lock nesting is

    cgroup_mutex -> cpu_hotplug_lock

and for the hotplug synchronous case this turns into

    cpu_hotplug_lock -> cgroup_mutex

which is an obvious deadlock.

So the only requirement is to avoid _that_ particular lock nesting, but
having:

    cpu_add_remove_lock -> cpu_hotplug_lock

and

    cpu_add_remove_lock -> cgroup_mutex -> cpu_hotplug_lock

which is built via waiting on the work to finish is perfectly fine.

Waiting under cpu_add_remove_lock is also perfectly fine and there are
other things in the hotplug callbacks which wait on stuff for way
longer.

Which in turn points to the obvious spots to add the required _two_ lines
of code which solve this without creating an unholy mess and sprinkling
that call all over the place.

 _cpu_up()
 {
        ...
        cpus_write_unlock(); <- Drops cpu_hotplug_lock
        ...
+       if (!tasks_frozen)
+		cpusets_wait_for_hotplug();
        return ret;

And the same for _cpu_down(), which obviously begs for a helper
function, which in turn needs tons of comments to explain why that place
is not a dump ground for random hacks of the day and what might if at
all be justified to be called there along with the implications of that

But that's a documentation thing, which would also be necessary when the
requirement would be to move it outside of cpu_add_remove_lock.

And none of this has anything to do with uevents.

Thanks,

        tglx
---
Subject: cpu/hotplug: Cure the cpusets trainwreck
From: Thomas Gleixner <tglx@linutronix.de>
Date: Sat, 27 Mar 2021 15:57:29 +0100

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Qais Yousef <qais.yousef@arm.com>
---
 kernel/cpu.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -32,6 +32,7 @@
 #include <linux/relay.h>
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/cpuset.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -867,6 +868,52 @@ void clear_tasks_mm_cpumask(int cpu)
 	rcu_read_unlock();
 }
 
+/*
+ *
+ * Serialize hotplug trainwrecks outside of the cpu_hotplug_lock
+ * protected region.
+ *
+ * The operation is still serialized against concurrent CPU hotplug via
+ * cpu_add_remove_lock, i.e. CPU map protection.  But it is _not_
+ * serialized against other hotplug related activity like adding or
+ * removing of state callbacks and state instances, which invoke either the
+ * startup or the teardown callback of the affected state.
+ *
+ * This is required for subsystems which are unfixable vs. CPU hotplug and
+ * evade lock inversion problems by scheduling work which has to be
+ * completed _before_ cpu_up()/_cpu_down() returns.
+ *
+ * Don't even think about adding anything to this for any new code or even
+ * drivers. It's only purpose is to keep existing lock order trainwrecks
+ * working.
+ *
+ * For cpu_down() there might be valid reasons to finish cleanups which are
+ * not required to be done under cpu_hotplug_lock, but that's a different
+ * story and would be not invoked via this.
+ */
+static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen)
+{
+	/*
+	 * cpusets delegate hotplug operations to a worker to "solve" the
+	 * lock order problems. Wait for the worker, but only if tasks are
+	 * _not_ frozen (suspend, hibernate) as that would wait forever.
+	 *
+	 * The wait is required because otherwise the hotplug operation
+	 * returns with inconsistent state, which could even be observed in
+	 * user space when a new CPU is brought up. The CPU plug uevent
+	 * would be delivered and user space reacting on it would fail to
+	 * move tasks to the newly plugged CPU up to the point where the
+	 * work has finished because up to that point the newly plugged CPU
+	 * is not assignable in cpusets/cgroups. On unplug that's not
+	 * necessarily a visible issue, but it is still inconsistent state,
+	 * which is the real problem which needs to be "fixed". This can't
+	 * prevent the transient state between scheduling the work and
+	 * returning from waiting for it.
+	 */
+	if (!tasks_frozen)
+		cpuset_wait_for_hotplug();
+}
+
 /* Take this CPU down. */
 static int take_cpu_down(void *_param)
 {
@@ -1058,6 +1105,7 @@ static int __ref _cpu_down(unsigned int
 	 */
 	lockup_detector_cleanup();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 
@@ -1254,6 +1302,7 @@ static int _cpu_up(unsigned int cpu, int
 out:
 	cpus_write_unlock();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-03-27 21:01   ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-03-27 21:01 UTC (permalink / raw)
  To: Alexey Klimov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, yury.norov-Re5JQEeQqe8AvxtiuMwx3w,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	jobaker-H+wXaHxf7aLQT0dZR+AlfA,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, arnd-r2nGTMty4D4,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	qais.yousef-5wv7dgnIgG8, hannes-druUgvl0LCNAfugRpC6u6w,
	klimov.linux-Re5JQEeQqe8AvxtiuMwx3w

Alexey,

On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> recently onlined CPU after receiving uevent.
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.

And because cpusets are using a workqueue just to deal with their
backwards lock order we need to cure the symptom in the CPU hotplug
code, right?

> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().

It can also be avoided by fixing the root cause which is _NOT_ in the
CPU hotplug code at all.

The fundamental assumption of CPU hotplug is that if the state machine
reaches a given state, which might have user space visible effects or
even just kernel visible effects, the overall state of the system has to
be consistent.

cpusets violate this assumption. And they do so since 2013 due to commit
3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").

If that cannot be fixed in cgroups/cpusets with out rewriting the whole
cpusets/cgroups muck, then this want's to be explained and justified in the
changelog.

Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
is non trivial because that changelog clearly states that the lock order
is a design decision and that design decision required that workqueue
workaround....

See? Now we suddenly have a proper root cause and not just a description
of the symptom with some hidden hint about workqueues. And we have an
argument why fixing the root cause is close to impossible.

>  int cpu_device_up(struct device *dev)
>  {
> -	return cpu_up(dev->id, CPUHP_ONLINE);
> +	int err;
> +
> +	err = cpu_up(dev->id, CPUHP_ONLINE);
> +	/*
> +	 * Wait for cpuset updates to cpumasks to finish.  Later on this path
> +	 * may generate uevents whose consumers rely on the updates.
> +	 */
> +	if (!err)
> +		cpuset_wait_for_hotplug();

No. Quite some people wasted^Wspent a considerable amount of time to get
the hotplug trainwreck cleaned up and we are not sprinkling random
workarounds all over the place again.

>  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  {
> -	int cpu, ret = 0;
> +	cpumask_var_t mask;
> +	int cpu, ret;
>  
> +	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	ret = 0;
>  	cpu_maps_update_begin();
>  	for_each_online_cpu(cpu) {
>  		if (topology_is_primary_thread(cpu))
> @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
>  		ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
>  		if (ret)
>  			break;
> -		/*
> -		 * As this needs to hold the cpu maps lock it's impossible
> -		 * to call device_offline() because that ends up calling
> -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> -		 * needs to be held as this might race against in kernel
> -		 * abusers of the hotplug machinery (thermal management).
> -		 *
> -		 * So nothing would update device:offline state. That would
> -		 * leave the sysfs entry stale and prevent onlining after
> -		 * smt control has been changed to 'off' again. This is
> -		 * called under the sysfs hotplug lock, so it is properly
> -		 * serialized against the regular offline usage.
> -		 */
> -		cpuhp_offline_cpu_device(cpu);
> +
> +		cpumask_set_cpu(cpu, mask);
>  	}
>  	if (!ret)
>  		cpu_smt_control = ctrlval;
>  	cpu_maps_update_done();
> +
> +	/*
> +	 * When the cpu maps lock was taken above it was impossible
> +	 * to call device_offline() because that ends up calling
> +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> +	 * needed to be held as this might race against in-kernel
> +	 * abusers of the hotplug machinery (thermal management).
> +	 *
> +	 * So nothing would update device:offline state. That would
> +	 * leave the sysfs entry stale and prevent onlining after
> +	 * smt control has been changed to 'off' again. This is
> +	 * called under the sysfs hotplug lock, so it is properly
> +	 * serialized against the regular offline usage.
> +	 */
> +	for_each_cpu(cpu, mask)
> +		cpuhp_offline_cpu_device(cpu);
> +
> +	free_cpumask_var(mask);
>  	return ret;

Brilliant stuff that. All this does is to move the uevent out of the cpu
maps locked region. What for? There is no wait here? Why?

The work is scheduled whenever the active state of a CPU is changed. And
just because on unplug it's not triggering any test program failure
(yet) does not make it more correct. The uevent is again delivered
before consistent state is reached.

And to be clear, it's not about uevent at all. That's the observable
side of it.

The point is that the state is inconsistent when the hotplug functions
return. That's the only really interesting statement here.

And while you carefully reworded the comment, did you actually read what
it said and what is says now?

> -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> -		 * needs to be held as this might race against in kernel
> -		 * abusers of the hotplug machinery (thermal management).

vs.

> +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> +	 * needed to be held as this might race against in-kernel
> +	 * abusers of the hotplug machinery (thermal management).

The big fat hint is: "cpu maps lock needs to be held as this ...." and
it still needs to be held for the above loop to work correctly at
all. See also below.

So just moving comments blindly around and making them past tense is not
cutting it. Quite the contrary the comments make no sense anymore. They
became uncomprehensible word salad.

Now for the second part of that comment:

> +      *                                          ....  This is
> +	 * called under the sysfs hotplug lock, so it is properly
> +	 * serialized against the regular offline usage.

So there are two layers of protection:

   cpu_maps_lock and sysfs hotplug lock

One protects surprisingly against concurrent sysfs writes and the other
is required to serialize in kernel usage.

Now lets look at the original protection:

   lock(sysfs)
     lock(cpu_maps)
       hotplug
        dev->offline = new_state
        uevent()
     unlock(cpu_maps)
   unlock(sysfs)

and the one you created:

   lock(sysfs)
     lock(cpu_maps)
       hotplug
     unlock(cpu_maps)
     dev->offline = new_state
     uevent()
   unlock(sysfs)

Where is that protection scope change mentioned in the change log and
where is the analysis that it is correct?

Now you also completely fail to explain _why_ the uevent and the
wait_for() muck have to be done _outside_ of the cpu maps locked region.

workqueues and cpusets/cgroups have exactly zero lock relationship to
the map lock.

The actual problematic lock nesting is

    cgroup_mutex -> cpu_hotplug_lock

and for the hotplug synchronous case this turns into

    cpu_hotplug_lock -> cgroup_mutex

which is an obvious deadlock.

So the only requirement is to avoid _that_ particular lock nesting, but
having:

    cpu_add_remove_lock -> cpu_hotplug_lock

and

    cpu_add_remove_lock -> cgroup_mutex -> cpu_hotplug_lock

which is built via waiting on the work to finish is perfectly fine.

Waiting under cpu_add_remove_lock is also perfectly fine and there are
other things in the hotplug callbacks which wait on stuff for way
longer.

Which in turn points to the obvious spots to add the required _two_ lines
of code which solve this without creating an unholy mess and sprinkling
that call all over the place.

 _cpu_up()
 {
        ...
        cpus_write_unlock(); <- Drops cpu_hotplug_lock
        ...
+       if (!tasks_frozen)
+		cpusets_wait_for_hotplug();
        return ret;

And the same for _cpu_down(), which obviously begs for a helper
function, which in turn needs tons of comments to explain why that place
is not a dump ground for random hacks of the day and what might if at
all be justified to be called there along with the implications of that

But that's a documentation thing, which would also be necessary when the
requirement would be to move it outside of cpu_add_remove_lock.

And none of this has anything to do with uevents.

Thanks,

        tglx
---
Subject: cpu/hotplug: Cure the cpusets trainwreck
From: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Date: Sat, 27 Mar 2021 15:57:29 +0100

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Reported-by: Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reported-by: Joshua Baker <jobaker-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Qais Yousef <qais.yousef-5wv7dgnIgG8@public.gmane.org>
---
 kernel/cpu.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -32,6 +32,7 @@
 #include <linux/relay.h>
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/cpuset.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -867,6 +868,52 @@ void clear_tasks_mm_cpumask(int cpu)
 	rcu_read_unlock();
 }
 
+/*
+ *
+ * Serialize hotplug trainwrecks outside of the cpu_hotplug_lock
+ * protected region.
+ *
+ * The operation is still serialized against concurrent CPU hotplug via
+ * cpu_add_remove_lock, i.e. CPU map protection.  But it is _not_
+ * serialized against other hotplug related activity like adding or
+ * removing of state callbacks and state instances, which invoke either the
+ * startup or the teardown callback of the affected state.
+ *
+ * This is required for subsystems which are unfixable vs. CPU hotplug and
+ * evade lock inversion problems by scheduling work which has to be
+ * completed _before_ cpu_up()/_cpu_down() returns.
+ *
+ * Don't even think about adding anything to this for any new code or even
+ * drivers. It's only purpose is to keep existing lock order trainwrecks
+ * working.
+ *
+ * For cpu_down() there might be valid reasons to finish cleanups which are
+ * not required to be done under cpu_hotplug_lock, but that's a different
+ * story and would be not invoked via this.
+ */
+static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen)
+{
+	/*
+	 * cpusets delegate hotplug operations to a worker to "solve" the
+	 * lock order problems. Wait for the worker, but only if tasks are
+	 * _not_ frozen (suspend, hibernate) as that would wait forever.
+	 *
+	 * The wait is required because otherwise the hotplug operation
+	 * returns with inconsistent state, which could even be observed in
+	 * user space when a new CPU is brought up. The CPU plug uevent
+	 * would be delivered and user space reacting on it would fail to
+	 * move tasks to the newly plugged CPU up to the point where the
+	 * work has finished because up to that point the newly plugged CPU
+	 * is not assignable in cpusets/cgroups. On unplug that's not
+	 * necessarily a visible issue, but it is still inconsistent state,
+	 * which is the real problem which needs to be "fixed". This can't
+	 * prevent the transient state between scheduling the work and
+	 * returning from waiting for it.
+	 */
+	if (!tasks_frozen)
+		cpuset_wait_for_hotplug();
+}
+
 /* Take this CPU down. */
 static int take_cpu_down(void *_param)
 {
@@ -1058,6 +1105,7 @@ static int __ref _cpu_down(unsigned int
 	 */
 	lockup_detector_cleanup();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 
@@ -1254,6 +1302,7 @@ static int _cpu_up(unsigned int cpu, int
 out:
 	cpus_write_unlock();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-01 14:06     ` Qais Yousef
  0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2021-04-01 14:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexey Klimov, linux-kernel, cgroups, peterz, yury.norov,
	daniel.m.jordan, jobaker, audralmitchel, arnd, gregkh, rafael,
	tj, hannes, klimov.linux

On 03/27/21 22:01, Thomas Gleixner wrote:
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
> 
> > -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> > -		 * needs to be held as this might race against in kernel
> > -		 * abusers of the hotplug machinery (thermal management).
> 
> vs.
> 
> > +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> > +	 * needed to be held as this might race against in-kernel
> > +	 * abusers of the hotplug machinery (thermal management).
> 
> The big fat hint is: "cpu maps lock needs to be held as this ...." and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
> 
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.
> 
> Now for the second part of that comment:
> 
> > +      *                                          ....  This is
> > +	 * called under the sysfs hotplug lock, so it is properly
> > +	 * serialized against the regular offline usage.
> 
> So there are two layers of protection:
> 
>    cpu_maps_lock and sysfs hotplug lock
> 
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
> 
> Now lets look at the original protection:
> 
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>         dev->offline = new_state
>         uevent()
>      unlock(cpu_maps)
>    unlock(sysfs)
> 
> and the one you created:
> 
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>      unlock(cpu_maps)
>      dev->offline = new_state
>      uevent()
>    unlock(sysfs)
> 
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

The comment do still need updating though. Its reference to thermal management
is cryptic. I couldn't see who performs hotplug operations in thermal code.

I also think generally that comment is no longer valid after the refactor I did
to 'prevent' in-kernel users from calling cpu_up/down() directly and force them
all to go via device_offline/online() which is wrapped via the add/remove_cpu()

	33c3736ec888 cpu/hotplug: Hide cpu_up/down()

So I think except for suspend/resume/hibernate/kexec, all in-kernel users
should be serialized by lock(hotplug) now. Since uevents() don't matter for
suspend/resume/hiberante/kexec I think moving it outside of lock(cpu_maps) is
fine.

So IIUC in the past we had the race

	userspace			in-kernel users

	lock(hotplug)
	cpuhp_smt_disable()
	   lock(cpu_maps)		cpu_down()
					  lock(cpu_maps)

So they were serialized by cpu_maps lock. But that should not happen now since
in-kernel (except for the aforementioned) users should use remove_cpu().

	userspace			in-kernel users

	lock(hotplug)
	cpuhp_smt_disable()
	   lock(cpu_maps)		remove_cpu()
					  lock(hotplug)
					  device_offline()
					    cpu_down()
					      lock(cpu_maps)

Which forces the serialization at lock(hotplug), which is what
lock_device_hotplug_sysfs() is actually tries to hold.

So I think that race condition should not happen now. Or did I miss something?

The only loophole is that cpu_device_up() could be abused if not caught by
reviewers. I didn't find a way to warn/error if someone other than
cpu_subsys_online() uses it. We rely on a comment explaining it..

I think cpuhp_smt_disable/enable can safely call device_offline/online now.
Although it might still be more efficient to hold lock(cpu_maps) once than
repeatedly in a loop.

If we do that, then cpu_up_down_serialize_trainwrech() can be called from
cpu_device_up/down() which implies !task_frozen.

Can't remember now if Alexey moved the uevent() handling out of the loop for
efficiency reasons or was seeing something else. I doubt it was the latter.


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-01 14:06     ` Qais Yousef
  0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2021-04-01 14:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexey Klimov, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	yury.norov-Re5JQEeQqe8AvxtiuMwx3w,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	jobaker-H+wXaHxf7aLQT0dZR+AlfA,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, arnd-r2nGTMty4D4,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	rafael-DgEjT+Ai2ygdnm+yROfE0A, tj-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w,
	klimov.linux-Re5JQEeQqe8AvxtiuMwx3w

On 03/27/21 22:01, Thomas Gleixner wrote:
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
> 
> > -		 * cpu_down() which takes cpu maps lock. cpu maps lock
> > -		 * needs to be held as this might race against in kernel
> > -		 * abusers of the hotplug machinery (thermal management).
> 
> vs.
> 
> > +	 * cpu_down() which takes cpu maps lock. cpu maps lock
> > +	 * needed to be held as this might race against in-kernel
> > +	 * abusers of the hotplug machinery (thermal management).
> 
> The big fat hint is: "cpu maps lock needs to be held as this ...." and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
> 
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.
> 
> Now for the second part of that comment:
> 
> > +      *                                          ....  This is
> > +	 * called under the sysfs hotplug lock, so it is properly
> > +	 * serialized against the regular offline usage.
> 
> So there are two layers of protection:
> 
>    cpu_maps_lock and sysfs hotplug lock
> 
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
> 
> Now lets look at the original protection:
> 
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>         dev->offline = new_state
>         uevent()
>      unlock(cpu_maps)
>    unlock(sysfs)
> 
> and the one you created:
> 
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>      unlock(cpu_maps)
>      dev->offline = new_state
>      uevent()
>    unlock(sysfs)
> 
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

The comment do still need updating though. Its reference to thermal management
is cryptic. I couldn't see who performs hotplug operations in thermal code.

I also think generally that comment is no longer valid after the refactor I did
to 'prevent' in-kernel users from calling cpu_up/down() directly and force them
all to go via device_offline/online() which is wrapped via the add/remove_cpu()

	33c3736ec888 cpu/hotplug: Hide cpu_up/down()

So I think except for suspend/resume/hibernate/kexec, all in-kernel users
should be serialized by lock(hotplug) now. Since uevents() don't matter for
suspend/resume/hiberante/kexec I think moving it outside of lock(cpu_maps) is
fine.

So IIUC in the past we had the race

	userspace			in-kernel users

	lock(hotplug)
	cpuhp_smt_disable()
	   lock(cpu_maps)		cpu_down()
					  lock(cpu_maps)

So they were serialized by cpu_maps lock. But that should not happen now since
in-kernel (except for the aforementioned) users should use remove_cpu().

	userspace			in-kernel users

	lock(hotplug)
	cpuhp_smt_disable()
	   lock(cpu_maps)		remove_cpu()
					  lock(hotplug)
					  device_offline()
					    cpu_down()
					      lock(cpu_maps)

Which forces the serialization at lock(hotplug), which is what
lock_device_hotplug_sysfs() is actually tries to hold.

So I think that race condition should not happen now. Or did I miss something?

The only loophole is that cpu_device_up() could be abused if not caught by
reviewers. I didn't find a way to warn/error if someone other than
cpu_subsys_online() uses it. We rely on a comment explaining it..

I think cpuhp_smt_disable/enable can safely call device_offline/online now.
Although it might still be more efficient to hold lock(cpu_maps) once than
repeatedly in a loop.

If we do that, then cpu_up_down_serialize_trainwrech() can be called from
cpu_device_up/down() which implies !task_frozen.

Can't remember now if Alexey moved the uevent() handling out of the loop for
efficiency reasons or was seeing something else. I doubt it was the latter.


Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-04  2:32     ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-04-04  2:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexey Klimov, Linux Kernel Mailing List, cgroups, peterz,
	Yury Norov, daniel.m.jordan, jobaker, audralmitchel,
	Arnd Bergmann, Greg Kroah-Hartman, rafael, tj, qais.yousef,
	hannes

On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:

Lovely that you eventually found time to take a look at this since
first RFC patch was sent.

> Alexey,
>
> On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> > When a CPU offlined and onlined via device_offline() and device_online()
> > the userspace gets uevent notification. If, after receiving "online" uevent,
> > userspace executes sched_setaffinity() on some task trying to move it
> > to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> > needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> > recently onlined CPU after receiving uevent.
> >
> > Cpusets used in guarantee_online_cpus() are updated using workqueue from
> > cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> > sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> > it is called immediately after uevent.
>
> And because cpusets are using a workqueue just to deal with their
> backwards lock order we need to cure the symptom in the CPU hotplug
> code, right?

Feel free to suggest a better place.

> > Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> > has run to completion using cpuset_wait_for_hotplug() after onlining the
> > cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> It can also be avoided by fixing the root cause which is _NOT_ in the
> CPU hotplug code at all.
>
> The fundamental assumption of CPU hotplug is that if the state machine
> reaches a given state, which might have user space visible effects or
> even just kernel visible effects, the overall state of the system has to
> be consistent.
>
> cpusets violate this assumption. And they do so since 2013 due to commit
> 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").
>
> If that cannot be fixed in cgroups/cpusets with out rewriting the whole
> cpusets/cgroups muck, then this want's to be explained and justified in the
> changelog.
>
> Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
> is non trivial because that changelog clearly states that the lock order
> is a design decision and that design decision required that workqueue
> workaround....
>
> See? Now we suddenly have a proper root cause and not just a description
> of the symptom with some hidden hint about workqueues. And we have an
> argument why fixing the root cause is close to impossible.

Thank you for this educational scolding here and below. I see that
problem here is more fundamental than I thought before and my commit
messages standards are too low for you.
Good to see that bug that may exist since 2013 could be fixed finally.

> >  int cpu_device_up(struct device *dev)
> >  {
> > -     return cpu_up(dev->id, CPUHP_ONLINE);
> > +     int err;
> > +
> > +     err = cpu_up(dev->id, CPUHP_ONLINE);
> > +     /*
> > +      * Wait for cpuset updates to cpumasks to finish.  Later on this path
> > +      * may generate uevents whose consumers rely on the updates.
> > +      */
> > +     if (!err)
> > +             cpuset_wait_for_hotplug();
>
> No. Quite some people wasted^Wspent a considerable amount of time to get
> the hotplug trainwreck cleaned up and we are not sprinkling random
> workarounds all over the place again.
>
> >  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >  {
> > -     int cpu, ret = 0;
> > +     cpumask_var_t mask;
> > +     int cpu, ret;
> >
> > +     if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> > +             return -ENOMEM;
> > +
> > +     ret = 0;
> >       cpu_maps_update_begin();
> >       for_each_online_cpu(cpu) {
> >               if (topology_is_primary_thread(cpu))
> > @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >               ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
> >               if (ret)
> >                       break;
> > -             /*
> > -              * As this needs to hold the cpu maps lock it's impossible
> > -              * to call device_offline() because that ends up calling
> > -              * cpu_down() which takes cpu maps lock. cpu maps lock
> > -              * needs to be held as this might race against in kernel
> > -              * abusers of the hotplug machinery (thermal management).
> > -              *
> > -              * So nothing would update device:offline state. That would
> > -              * leave the sysfs entry stale and prevent onlining after
> > -              * smt control has been changed to 'off' again. This is
> > -              * called under the sysfs hotplug lock, so it is properly
> > -              * serialized against the regular offline usage.
> > -              */
> > -             cpuhp_offline_cpu_device(cpu);
> > +
> > +             cpumask_set_cpu(cpu, mask);
> >       }
> >       if (!ret)
> >               cpu_smt_control = ctrlval;
> >       cpu_maps_update_done();
> > +
> > +     /*
> > +      * When the cpu maps lock was taken above it was impossible
> > +      * to call device_offline() because that ends up calling
> > +      * cpu_down() which takes cpu maps lock. cpu maps lock
> > +      * needed to be held as this might race against in-kernel
> > +      * abusers of the hotplug machinery (thermal management).
> > +      *
> > +      * So nothing would update device:offline state. That would
> > +      * leave the sysfs entry stale and prevent onlining after
> > +      * smt control has been changed to 'off' again. This is
> > +      * called under the sysfs hotplug lock, so it is properly
> > +      * serialized against the regular offline usage.
> > +      */
> > +     for_each_cpu(cpu, mask)
> > +             cpuhp_offline_cpu_device(cpu);
> > +
> > +     free_cpumask_var(mask);
> >       return ret;
>
> Brilliant stuff that. All this does is to move the uevent out of the cpu
> maps locked region. What for? There is no wait here? Why?

To match cpuhp_smt_enable().

> The work is scheduled whenever the active state of a CPU is changed. And
> just because on unplug it's not triggering any test program failure
> (yet) does not make it more correct. The uevent is again delivered
> before consistent state is reached.
>
> And to be clear, it's not about uevent at all. That's the observable
> side of it.
>
> The point is that the state is inconsistent when the hotplug functions
> return. That's the only really interesting statement here.
>
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
>
> > -              * cpu_down() which takes cpu maps lock. cpu maps lock
> > -              * needs to be held as this might race against in kernel
> > -              * abusers of the hotplug machinery (thermal management).
>
> vs.
>
> > +      * cpu_down() which takes cpu maps lock. cpu maps lock
> > +      * needed to be held as this might race against in-kernel
> > +      * abusers of the hotplug machinery (thermal management).
>
> The big fat hint is: "cpu maps lock needs to be held as this ...." and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
>
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.

My bad.

Why while you're on this I need to ask one question about:
* ...cpu maps lock
* needs to be held as this might race against in kernel
* abusers of the hotplug machinery (thermal management).

Who are the abusers of hotplug from the thermal management side?
I spent (wasted) a considerable amount of time searching for them in
drivers/thermal/* but nothing seemed suspicious.

> Now for the second part of that comment:
>
> > +      *                                          ....  This is
> > +      * called under the sysfs hotplug lock, so it is properly
> > +      * serialized against the regular offline usage.
>
> So there are two layers of protection:
>
>    cpu_maps_lock and sysfs hotplug lock
>
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
>
> Now lets look at the original protection:
>
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>         dev->offline = new_state
>         uevent()
>      unlock(cpu_maps)
>    unlock(sysfs)
>
> and the one you created:
>
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>      unlock(cpu_maps)
>      dev->offline = new_state
>      uevent()
>    unlock(sysfs)
>
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

It was not mentioned in commit messages. It was mentioned in the place
that describes changes between different versions of the patch.
There is already a place in device_online() and device_offline() where
uevent() is sent under lock(sysfs) lock only (in your notation) and
outside cpu_maps lock (the same goes for dev->offline status). IIRC
that's why I didn't explain it specifically.
Will do better in future.

> Now you also completely fail to explain _why_ the uevent and the
> wait_for() muck have to be done _outside_ of the cpu maps locked region.

That's correct: there is no explanation.
Before it became more fundamental and about consistency of hotplug
state, this was done to perform one cpusets_wait_for_hotplug() after
all hotplug calls rather than doing cpusets_wait_for_hotplug() for
each individual cpu_up/cpu_down.

> workqueues and cpusets/cgroups have exactly zero lock relationship to
> the map lock.
>
> The actual problematic lock nesting is
>
>     cgroup_mutex -> cpu_hotplug_lock
>
> and for the hotplug synchronous case this turns into
>
>     cpu_hotplug_lock -> cgroup_mutex
>
> which is an obvious deadlock.
>
> So the only requirement is to avoid _that_ particular lock nesting, but
> having:
>
>     cpu_add_remove_lock -> cpu_hotplug_lock
>
> and
>
>     cpu_add_remove_lock -> cgroup_mutex -> cpu_hotplug_lock
>
> which is built via waiting on the work to finish is perfectly fine.
>
> Waiting under cpu_add_remove_lock is also perfectly fine and there are
> other things in the hotplug callbacks which wait on stuff for way
> longer.
>
> Which in turn points to the obvious spots to add the required _two_ lines
> of code which solve this without creating an unholy mess and sprinkling
> that call all over the place.
>
>  _cpu_up()
>  {
>         ...
>         cpus_write_unlock(); <- Drops cpu_hotplug_lock
>         ...
> +       if (!tasks_frozen)
> +               cpusets_wait_for_hotplug();
>         return ret;
>
> And the same for _cpu_down(), which obviously begs for a helper
> function, which in turn needs tons of comments to explain why that place
> is not a dump ground for random hacks of the day and what might if at
> all be justified to be called there along with the implications of that
>
> But that's a documentation thing, which would also be necessary when the
> requirement would be to move it outside of cpu_add_remove_lock.
>
> And none of this has anything to do with uevents.
>
> Thanks,
>
>         tglx
> ---
> Subject: cpu/hotplug: Cure the cpusets trainwreck
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Sat, 27 Mar 2021 15:57:29 +0100
>
> Alexey and Joshua tried to solve a cpusets related hotplug problem which is
> user space visible and results in unexpected behaviour for some time after
> a CPU has been plugged in and the corresponding uevent was delivered.

I'll give it a go and get back to you. It seems this patch should work.

[...]

Best regards,
Alexey Klimov.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-04  2:32     ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-04-04  2:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexey Klimov, Linux Kernel Mailing List,
	cgroups-u79uwXL29TY76Z2rM5mHXA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Yury Norov, daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	jobaker-H+wXaHxf7aLQT0dZR+AlfA,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, Arnd Bergmann,
	Greg Kroah-Hartman, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	tj-DgEjT+Ai2ygdnm+yROfE0A, qais.yousef-5wv7dgnIgG8,
	hannes-druUgvl0LCNAfugRpC6u6w

On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:

Lovely that you eventually found time to take a look at this since
first RFC patch was sent.

> Alexey,
>
> On Wed, Mar 17 2021 at 00:36, Alexey Klimov wrote:
> > When a CPU offlined and onlined via device_offline() and device_online()
> > the userspace gets uevent notification. If, after receiving "online" uevent,
> > userspace executes sched_setaffinity() on some task trying to move it
> > to a recently onlined CPU, then it sometimes fails with -EINVAL. Userspace
> > needs to wait around 5..30 ms before sched_setaffinity() will succeed for
> > recently onlined CPU after receiving uevent.
> >
> > Cpusets used in guarantee_online_cpus() are updated using workqueue from
> > cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> > sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> > it is called immediately after uevent.
>
> And because cpusets are using a workqueue just to deal with their
> backwards lock order we need to cure the symptom in the CPU hotplug
> code, right?

Feel free to suggest a better place.

> > Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> > has run to completion using cpuset_wait_for_hotplug() after onlining the
> > cpu in cpu_device_up() and in cpuhp_smt_enable().
>
> It can also be avoided by fixing the root cause which is _NOT_ in the
> CPU hotplug code at all.
>
> The fundamental assumption of CPU hotplug is that if the state machine
> reaches a given state, which might have user space visible effects or
> even just kernel visible effects, the overall state of the system has to
> be consistent.
>
> cpusets violate this assumption. And they do so since 2013 due to commit
> 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()").
>
> If that cannot be fixed in cgroups/cpusets with out rewriting the whole
> cpusets/cgroups muck, then this want's to be explained and justified in the
> changelog.
>
> Looking at the changelog of 3a5a6d0c2b03 it's entirely clear that this
> is non trivial because that changelog clearly states that the lock order
> is a design decision and that design decision required that workqueue
> workaround....
>
> See? Now we suddenly have a proper root cause and not just a description
> of the symptom with some hidden hint about workqueues. And we have an
> argument why fixing the root cause is close to impossible.

Thank you for this educational scolding here and below. I see that
problem here is more fundamental than I thought before and my commit
messages standards are too low for you.
Good to see that bug that may exist since 2013 could be fixed finally.

> >  int cpu_device_up(struct device *dev)
> >  {
> > -     return cpu_up(dev->id, CPUHP_ONLINE);
> > +     int err;
> > +
> > +     err = cpu_up(dev->id, CPUHP_ONLINE);
> > +     /*
> > +      * Wait for cpuset updates to cpumasks to finish.  Later on this path
> > +      * may generate uevents whose consumers rely on the updates.
> > +      */
> > +     if (!err)
> > +             cpuset_wait_for_hotplug();
>
> No. Quite some people wasted^Wspent a considerable amount of time to get
> the hotplug trainwreck cleaned up and we are not sprinkling random
> workarounds all over the place again.
>
> >  int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >  {
> > -     int cpu, ret = 0;
> > +     cpumask_var_t mask;
> > +     int cpu, ret;
> >
> > +     if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> > +             return -ENOMEM;
> > +
> > +     ret = 0;
> >       cpu_maps_update_begin();
> >       for_each_online_cpu(cpu) {
> >               if (topology_is_primary_thread(cpu))
> > @@ -2093,31 +2109,42 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> >               ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
> >               if (ret)
> >                       break;
> > -             /*
> > -              * As this needs to hold the cpu maps lock it's impossible
> > -              * to call device_offline() because that ends up calling
> > -              * cpu_down() which takes cpu maps lock. cpu maps lock
> > -              * needs to be held as this might race against in kernel
> > -              * abusers of the hotplug machinery (thermal management).
> > -              *
> > -              * So nothing would update device:offline state. That would
> > -              * leave the sysfs entry stale and prevent onlining after
> > -              * smt control has been changed to 'off' again. This is
> > -              * called under the sysfs hotplug lock, so it is properly
> > -              * serialized against the regular offline usage.
> > -              */
> > -             cpuhp_offline_cpu_device(cpu);
> > +
> > +             cpumask_set_cpu(cpu, mask);
> >       }
> >       if (!ret)
> >               cpu_smt_control = ctrlval;
> >       cpu_maps_update_done();
> > +
> > +     /*
> > +      * When the cpu maps lock was taken above it was impossible
> > +      * to call device_offline() because that ends up calling
> > +      * cpu_down() which takes cpu maps lock. cpu maps lock
> > +      * needed to be held as this might race against in-kernel
> > +      * abusers of the hotplug machinery (thermal management).
> > +      *
> > +      * So nothing would update device:offline state. That would
> > +      * leave the sysfs entry stale and prevent onlining after
> > +      * smt control has been changed to 'off' again. This is
> > +      * called under the sysfs hotplug lock, so it is properly
> > +      * serialized against the regular offline usage.
> > +      */
> > +     for_each_cpu(cpu, mask)
> > +             cpuhp_offline_cpu_device(cpu);
> > +
> > +     free_cpumask_var(mask);
> >       return ret;
>
> Brilliant stuff that. All this does is to move the uevent out of the cpu
> maps locked region. What for? There is no wait here? Why?

To match cpuhp_smt_enable().

> The work is scheduled whenever the active state of a CPU is changed. And
> just because on unplug it's not triggering any test program failure
> (yet) does not make it more correct. The uevent is again delivered
> before consistent state is reached.
>
> And to be clear, it's not about uevent at all. That's the observable
> side of it.
>
> The point is that the state is inconsistent when the hotplug functions
> return. That's the only really interesting statement here.
>
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
>
> > -              * cpu_down() which takes cpu maps lock. cpu maps lock
> > -              * needs to be held as this might race against in kernel
> > -              * abusers of the hotplug machinery (thermal management).
>
> vs.
>
> > +      * cpu_down() which takes cpu maps lock. cpu maps lock
> > +      * needed to be held as this might race against in-kernel
> > +      * abusers of the hotplug machinery (thermal management).
>
> The big fat hint is: "cpu maps lock needs to be held as this ...." and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
>
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.

My bad.

Why while you're on this I need to ask one question about:
* ...cpu maps lock
* needs to be held as this might race against in kernel
* abusers of the hotplug machinery (thermal management).

Who are the abusers of hotplug from the thermal management side?
I spent (wasted) a considerable amount of time searching for them in
drivers/thermal/* but nothing seemed suspicious.

> Now for the second part of that comment:
>
> > +      *                                          ....  This is
> > +      * called under the sysfs hotplug lock, so it is properly
> > +      * serialized against the regular offline usage.
>
> So there are two layers of protection:
>
>    cpu_maps_lock and sysfs hotplug lock
>
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
>
> Now lets look at the original protection:
>
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>         dev->offline = new_state
>         uevent()
>      unlock(cpu_maps)
>    unlock(sysfs)
>
> and the one you created:
>
>    lock(sysfs)
>      lock(cpu_maps)
>        hotplug
>      unlock(cpu_maps)
>      dev->offline = new_state
>      uevent()
>    unlock(sysfs)
>
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

It was not mentioned in commit messages. It was mentioned in the place
that describes changes between different versions of the patch.
There is already a place in device_online() and device_offline() where
uevent() is sent under lock(sysfs) lock only (in your notation) and
outside cpu_maps lock (the same goes for dev->offline status). IIRC
that's why I didn't explain it specifically.
Will do better in future.

> Now you also completely fail to explain _why_ the uevent and the
> wait_for() muck have to be done _outside_ of the cpu maps locked region.

That's correct: there is no explanation.
Before it became more fundamental and about consistency of hotplug
state, this was done to perform one cpusets_wait_for_hotplug() after
all hotplug calls rather than doing cpusets_wait_for_hotplug() for
each individual cpu_up/cpu_down.

> workqueues and cpusets/cgroups have exactly zero lock relationship to
> the map lock.
>
> The actual problematic lock nesting is
>
>     cgroup_mutex -> cpu_hotplug_lock
>
> and for the hotplug synchronous case this turns into
>
>     cpu_hotplug_lock -> cgroup_mutex
>
> which is an obvious deadlock.
>
> So the only requirement is to avoid _that_ particular lock nesting, but
> having:
>
>     cpu_add_remove_lock -> cpu_hotplug_lock
>
> and
>
>     cpu_add_remove_lock -> cgroup_mutex -> cpu_hotplug_lock
>
> which is built via waiting on the work to finish is perfectly fine.
>
> Waiting under cpu_add_remove_lock is also perfectly fine and there are
> other things in the hotplug callbacks which wait on stuff for way
> longer.
>
> Which in turn points to the obvious spots to add the required _two_ lines
> of code which solve this without creating an unholy mess and sprinkling
> that call all over the place.
>
>  _cpu_up()
>  {
>         ...
>         cpus_write_unlock(); <- Drops cpu_hotplug_lock
>         ...
> +       if (!tasks_frozen)
> +               cpusets_wait_for_hotplug();
>         return ret;
>
> And the same for _cpu_down(), which obviously begs for a helper
> function, which in turn needs tons of comments to explain why that place
> is not a dump ground for random hacks of the day and what might if at
> all be justified to be called there along with the implications of that
>
> But that's a documentation thing, which would also be necessary when the
> requirement would be to move it outside of cpu_add_remove_lock.
>
> And none of this has anything to do with uevents.
>
> Thanks,
>
>         tglx
> ---
> Subject: cpu/hotplug: Cure the cpusets trainwreck
> From: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Date: Sat, 27 Mar 2021 15:57:29 +0100
>
> Alexey and Joshua tried to solve a cpusets related hotplug problem which is
> user space visible and results in unexpected behaviour for some time after
> a CPU has been plugged in and the corresponding uevent was delivered.

I'll give it a go and get back to you. It seems this patch should work.

[...]

Best regards,
Alexey Klimov.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-15  1:30       ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-04-15  1:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, cgroups, Peter Zijlstra, Yury Norov,
	Daniel Jordan, Joshua Baker, audralmitchel, Arnd Bergmann,
	Greg Kroah-Hartman, rafael, tj, Qais Yousef, hannes,
	Alexey Klimov

On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux@gmail.com> wrote:
>
> On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:

[...]

Now, the patch:

>> Subject: cpu/hotplug: Cure the cpusets trainwreck
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Sat, 27 Mar 2021 15:57:29 +0100
>>
>> Alexey and Joshua tried to solve a cpusets related hotplug problem which is
>> user space visible and results in unexpected behaviour for some time after
>> a CPU has been plugged in and the corresponding uevent was delivered.
>>
>> cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
>> workqueue. This is done because the cpusets code has already a lock
>> nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
>> waiting for the work to finish with cpu_hotplug_lock held can and will
>> deadlock because that results in the reverse lock order.
>>
>> As a consequence the uevent can be delivered before cpusets have consistent
>> state which means that a user space invocation of sched_setaffinity() to
>> move a task to the plugged CPU fails up to the point where the scheduled
>> work has been processed.
>>
>> The same is true for CPU unplug, but that does not create user observable
>> failure (yet).
>>
>> It's still inconsistent to claim that an operation is finished before it
>> actually is and that's the real issue at hand. uevents just make it
>> reliably observable.
>>
>> Obviously the problem should be fixed in cpusets/cgroups, but untangling
>> that is pretty much impossible because according to the changelog of the
>> commit which introduced this 8 years ago:
>>
>>  3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
>>
>> the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
>> the whole code is built around that.
>>
>> So bite the bullet and invoke the relevant cpuset function, which waits for
>> the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
>> only when tasks are not frozen by suspend/hibernate because that would
>> obviously wait forever.
>>
>> Waiting there with cpu_add_remove_lock, which is protecting the present
>> and possible CPU maps, held is not a problem at all because neither work
>> queues nor cpusets/cgroups have any lockchains related to that lock.
>>
>> Waiting in the hotplug machinery is not problematic either because there
>> are already state callbacks which wait for hardware queues to drain. It
>> makes the operations slightly slower, but hotplug is slow anyway.
>>
>> This ensures that state is consistent before returning from a hotplug
>> up/down operation. It's still inconsistent during the operation, but that's
>> a different story.
>>
>> Add a large comment which explains why this is done and why this is not a
>> dump ground for the hack of the day to work around half thought out locking
>> schemes. Document also the implications vs. hotplug operations and
>> serialization or the lack of it.
>>
>> Thanks to Alexy and Joshua for analyzing why this temporary
>> sched_setaffinity() failure happened.
>>
>> Reported-by: Alexey Klimov <aklimov@redhat.com>
>> Reported-by: Joshua Baker <jobaker@redhat.com>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
>> Cc: Qais Yousef <qais.yousef@arm.com>

Feel free to use:
Tested-by: Alexey Klimov <aklimov@redhat.com>

The bug doesn't reproduce with this change, I had the testcase running
for ~25 hrs without failing under different workloads.

Are you going to submit the patch? Or I can do it on your behalf if you like.

[...]

Best regards,
Alexey


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-04-15  1:30       ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-04-15  1:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Peter Zijlstra, Yury Norov, Daniel Jordan, Joshua Baker,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, Arnd Bergmann,
	Greg Kroah-Hartman, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	tj-DgEjT+Ai2ygdnm+yROfE0A, Qais Yousef,
	hannes-druUgvl0LCNAfugRpC6u6w, Alexey Klimov

On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:

[...]

Now, the patch:

>> Subject: cpu/hotplug: Cure the cpusets trainwreck
>> From: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>> Date: Sat, 27 Mar 2021 15:57:29 +0100
>>
>> Alexey and Joshua tried to solve a cpusets related hotplug problem which is
>> user space visible and results in unexpected behaviour for some time after
>> a CPU has been plugged in and the corresponding uevent was delivered.
>>
>> cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
>> workqueue. This is done because the cpusets code has already a lock
>> nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
>> waiting for the work to finish with cpu_hotplug_lock held can and will
>> deadlock because that results in the reverse lock order.
>>
>> As a consequence the uevent can be delivered before cpusets have consistent
>> state which means that a user space invocation of sched_setaffinity() to
>> move a task to the plugged CPU fails up to the point where the scheduled
>> work has been processed.
>>
>> The same is true for CPU unplug, but that does not create user observable
>> failure (yet).
>>
>> It's still inconsistent to claim that an operation is finished before it
>> actually is and that's the real issue at hand. uevents just make it
>> reliably observable.
>>
>> Obviously the problem should be fixed in cpusets/cgroups, but untangling
>> that is pretty much impossible because according to the changelog of the
>> commit which introduced this 8 years ago:
>>
>>  3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
>>
>> the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
>> the whole code is built around that.
>>
>> So bite the bullet and invoke the relevant cpuset function, which waits for
>> the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
>> only when tasks are not frozen by suspend/hibernate because that would
>> obviously wait forever.
>>
>> Waiting there with cpu_add_remove_lock, which is protecting the present
>> and possible CPU maps, held is not a problem at all because neither work
>> queues nor cpusets/cgroups have any lockchains related to that lock.
>>
>> Waiting in the hotplug machinery is not problematic either because there
>> are already state callbacks which wait for hardware queues to drain. It
>> makes the operations slightly slower, but hotplug is slow anyway.
>>
>> This ensures that state is consistent before returning from a hotplug
>> up/down operation. It's still inconsistent during the operation, but that's
>> a different story.
>>
>> Add a large comment which explains why this is done and why this is not a
>> dump ground for the hack of the day to work around half thought out locking
>> schemes. Document also the implications vs. hotplug operations and
>> serialization or the lack of it.
>>
>> Thanks to Alexy and Joshua for analyzing why this temporary
>> sched_setaffinity() failure happened.
>>
>> Reported-by: Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Reported-by: Joshua Baker <jobaker-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>> Cc: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Cc: Qais Yousef <qais.yousef-5wv7dgnIgG8@public.gmane.org>

Feel free to use:
Tested-by: Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

The bug doesn't reproduce with this change, I had the testcase running
for ~25 hrs without failing under different workloads.

Are you going to submit the patch? Or I can do it on your behalf if you like.

[...]

Best regards,
Alexey


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-06-14 18:14         ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-06-14 18:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, cgroups, Peter Zijlstra, Yury Norov,
	Daniel Jordan, Joshua Baker, audralmitchel, Arnd Bergmann,
	Greg Kroah-Hartman, rafael, tj, Qais Yousef, hannes,
	Alexey Klimov

Thomas,
Just gentle ping.

On Thu, Apr 15, 2021 at 2:30 AM Alexey Klimov <aklimov@redhat.com> wrote:
>
> On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux@gmail.com> wrote:
> >
> > On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:

[...]

> Are you going to submit the patch? Or I can do it on your behalf if you like.

Are you going to send out this to lkml as a separate patch or do you
want me to do this on your behalf?

Best regards,
Alexey


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-06-14 18:14         ` Alexey Klimov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Klimov @ 2021-06-14 18:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Peter Zijlstra, Yury Norov, Daniel Jordan, Joshua Baker,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, Arnd Bergmann,
	Greg Kroah-Hartman, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	tj-DgEjT+Ai2ygdnm+yROfE0A, Qais Yousef,
	hannes-druUgvl0LCNAfugRpC6u6w, Alexey Klimov

Thomas,
Just gentle ping.

On Thu, Apr 15, 2021 at 2:30 AM Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >
> > On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:

[...]

> Are you going to submit the patch? Or I can do it on your behalf if you like.

Are you going to send out this to lkml as a separate patch or do you
want me to do this on your behalf?

Best regards,
Alexey


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-06-19 20:18           ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-06-19 20:18 UTC (permalink / raw)
  To: Alexey Klimov
  Cc: Linux Kernel Mailing List, cgroups, Peter Zijlstra, Yury Norov,
	Daniel Jordan, Joshua Baker, audralmitchel, Arnd Bergmann,
	Greg Kroah-Hartman, rafael, tj, Qais Yousef, hannes,
	Alexey Klimov

Alexey,

On Mon, Jun 14 2021 at 19:14, Alexey Klimov wrote:
> On Thu, Apr 15, 2021 at 2:30 AM Alexey Klimov <aklimov@redhat.com> wrote:
>>
>> On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux@gmail.com> wrote:
>> >
>> > On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> [...]
>
>> Are you going to submit the patch? Or I can do it on your behalf if you like.
>
> Are you going to send out this to lkml as a separate patch or do you
> want me to do this on your behalf?

sorry I forgot about this completely. No need to resend. It's on LKML
already. I pick it up from this thread.

Thanks,

      tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining
@ 2021-06-19 20:18           ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-06-19 20:18 UTC (permalink / raw)
  To: Alexey Klimov
  Cc: Linux Kernel Mailing List, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Peter Zijlstra, Yury Norov, Daniel Jordan, Joshua Baker,
	audralmitchel-Re5JQEeQqe8AvxtiuMwx3w, Arnd Bergmann,
	Greg Kroah-Hartman, rafael-DgEjT+Ai2ygdnm+yROfE0A,
	tj-DgEjT+Ai2ygdnm+yROfE0A, Qais Yousef,
	hannes-druUgvl0LCNAfugRpC6u6w, Alexey Klimov

Alexey,

On Mon, Jun 14 2021 at 19:14, Alexey Klimov wrote:
> On Thu, Apr 15, 2021 at 2:30 AM Alexey Klimov <aklimov-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> On Sun, Apr 4, 2021 at 3:32 AM Alexey Klimov <klimov.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >
>> > On Sat, Mar 27, 2021 at 9:01 PM Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
>
> [...]
>
>> Are you going to submit the patch? Or I can do it on your behalf if you like.
>
> Are you going to send out this to lkml as a separate patch or do you
> want me to do this on your behalf?

sorry I forgot about this completely. No need to resend. It's on LKML
already. I pick it up from this thread.

Thanks,

      tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip: smp/urgent] cpu/hotplug: Cure the cpusets trainwreck
  2021-03-27 21:01   ` Thomas Gleixner
                     ` (2 preceding siblings ...)
  (?)
@ 2021-06-19 20:27   ` tip-bot2 for Thomas Gleixner
  -1 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-06-19 20:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Klimov, Joshua Baker, Thomas Gleixner, stable, x86, linux-kernel

The following commit has been merged into the smp/urgent branch of tip:

Commit-ID:     64c71be97c02c3d3f24dea7c290912ad300538b9
Gitweb:        https://git.kernel.org/tip/64c71be97c02c3d3f24dea7c290912ad300538b9
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 27 Mar 2021 22:01:36 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 19 Jun 2021 22:26:07 +02:00

cpu/hotplug: Cure the cpusets trainwreck

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
---
 kernel/cpu.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518..eccc8cf 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -32,6 +32,7 @@
 #include <linux/relay.h>
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/cpuset.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -919,6 +920,52 @@ void clear_tasks_mm_cpumask(int cpu)
 	rcu_read_unlock();
 }
 
+/*
+ *
+ * Serialize hotplug trainwrecks outside of the cpu_hotplug_lock
+ * protected region.
+ *
+ * The operation is still serialized against concurrent CPU hotplug via
+ * cpu_add_remove_lock, i.e. CPU map protection.  But it is _not_
+ * serialized against other hotplug related activity like adding or
+ * removing of state callbacks and state instances, which invoke either the
+ * startup or the teardown callback of the affected state.
+ *
+ * This is required for subsystems which are unfixable vs. CPU hotplug and
+ * evade lock inversion problems by scheduling work which has to be
+ * completed _before_ cpu_up()/_cpu_down() returns.
+ *
+ * Don't even think about adding anything to this for any new code or even
+ * drivers. It's only purpose is to keep existing lock order trainwrecks
+ * working.
+ *
+ * For cpu_down() there might be valid reasons to finish cleanups which are
+ * not required to be done under cpu_hotplug_lock, but that's a different
+ * story and would be not invoked via this.
+ */
+static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen)
+{
+	/*
+	 * cpusets delegate hotplug operations to a worker to "solve" the
+	 * lock order problems. Wait for the worker, but only if tasks are
+	 * _not_ frozen (suspend, hibernate) as that would wait forever.
+	 *
+	 * The wait is required because otherwise the hotplug operation
+	 * returns with inconsistent state, which could even be observed in
+	 * user space when a new CPU is brought up. The CPU plug uevent
+	 * would be delivered and user space reacting on it would fail to
+	 * move tasks to the newly plugged CPU up to the point where the
+	 * work has finished because up to that point the newly plugged CPU
+	 * is not assignable in cpusets/cgroups. On unplug that's not
+	 * necessarily a visible issue, but it is still inconsistent state,
+	 * which is the real problem which needs to be "fixed". This can't
+	 * prevent the transient state between scheduling the work and
+	 * returning from waiting for it.
+	 */
+	if (!tasks_frozen)
+		cpuset_wait_for_hotplug();
+}
+
 /* Take this CPU down. */
 static int take_cpu_down(void *_param)
 {
@@ -1108,6 +1155,7 @@ out:
 	 */
 	lockup_detector_cleanup();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 
@@ -1302,6 +1350,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 out:
 	cpus_write_unlock();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: smp/urgent] cpu/hotplug: Cure the cpusets trainwreck
  2021-03-27 21:01   ` Thomas Gleixner
                     ` (3 preceding siblings ...)
  (?)
@ 2021-06-21  8:36   ` tip-bot2 for Thomas Gleixner
  -1 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-06-21  8:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alexey Klimov, Joshua Baker, Thomas Gleixner, stable, x86, linux-kernel

The following commit has been merged into the smp/urgent branch of tip:

Commit-ID:     b22afcdf04c96ca58327784e280e10288cfd3303
Gitweb:        https://git.kernel.org/tip/b22afcdf04c96ca58327784e280e10288cfd3303
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 27 Mar 2021 22:01:36 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Mon, 21 Jun 2021 10:31:06 +02:00

cpu/hotplug: Cure the cpusets trainwreck

Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.

cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.

As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.

The same is true for CPU unplug, but that does not create user observable
failure (yet).

It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.

Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:

 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")

the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.

So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.

Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.

Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.

This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.

Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.

Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.

Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
---
 kernel/cpu.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518..d2e1692 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -32,6 +32,7 @@
 #include <linux/relay.h>
 #include <linux/slab.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/cpuset.h>
 
 #include <trace/events/power.h>
 #define CREATE_TRACE_POINTS
@@ -873,6 +874,52 @@ void __init cpuhp_threads_init(void)
 	kthread_unpark(this_cpu_read(cpuhp_state.thread));
 }
 
+/*
+ *
+ * Serialize hotplug trainwrecks outside of the cpu_hotplug_lock
+ * protected region.
+ *
+ * The operation is still serialized against concurrent CPU hotplug via
+ * cpu_add_remove_lock, i.e. CPU map protection.  But it is _not_
+ * serialized against other hotplug related activity like adding or
+ * removing of state callbacks and state instances, which invoke either the
+ * startup or the teardown callback of the affected state.
+ *
+ * This is required for subsystems which are unfixable vs. CPU hotplug and
+ * evade lock inversion problems by scheduling work which has to be
+ * completed _before_ cpu_up()/_cpu_down() returns.
+ *
+ * Don't even think about adding anything to this for any new code or even
+ * drivers. It's only purpose is to keep existing lock order trainwrecks
+ * working.
+ *
+ * For cpu_down() there might be valid reasons to finish cleanups which are
+ * not required to be done under cpu_hotplug_lock, but that's a different
+ * story and would be not invoked via this.
+ */
+static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen)
+{
+	/*
+	 * cpusets delegate hotplug operations to a worker to "solve" the
+	 * lock order problems. Wait for the worker, but only if tasks are
+	 * _not_ frozen (suspend, hibernate) as that would wait forever.
+	 *
+	 * The wait is required because otherwise the hotplug operation
+	 * returns with inconsistent state, which could even be observed in
+	 * user space when a new CPU is brought up. The CPU plug uevent
+	 * would be delivered and user space reacting on it would fail to
+	 * move tasks to the newly plugged CPU up to the point where the
+	 * work has finished because up to that point the newly plugged CPU
+	 * is not assignable in cpusets/cgroups. On unplug that's not
+	 * necessarily a visible issue, but it is still inconsistent state,
+	 * which is the real problem which needs to be "fixed". This can't
+	 * prevent the transient state between scheduling the work and
+	 * returning from waiting for it.
+	 */
+	if (!tasks_frozen)
+		cpuset_wait_for_hotplug();
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 #ifndef arch_clear_mm_cpumask_cpu
 #define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, mm_cpumask(mm))
@@ -1108,6 +1155,7 @@ out:
 	 */
 	lockup_detector_cleanup();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 
@@ -1302,6 +1350,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 out:
 	cpus_write_unlock();
 	arch_smt_update();
+	cpu_up_down_serialize_trainwrecks(tasks_frozen);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-06-21  8:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17  0:36 [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining Alexey Klimov
2021-03-17  0:36 ` Alexey Klimov
2021-03-18 19:28 ` Daniel Jordan
2021-03-18 19:28   ` Daniel Jordan
2021-03-27 21:01 ` Thomas Gleixner
2021-03-27 21:01   ` Thomas Gleixner
2021-04-01 14:06   ` Qais Yousef
2021-04-01 14:06     ` Qais Yousef
2021-04-04  2:32   ` Alexey Klimov
2021-04-04  2:32     ` Alexey Klimov
2021-04-15  1:30     ` Alexey Klimov
2021-04-15  1:30       ` Alexey Klimov
2021-06-14 18:14       ` Alexey Klimov
2021-06-14 18:14         ` Alexey Klimov
2021-06-19 20:18         ` Thomas Gleixner
2021-06-19 20:18           ` Thomas Gleixner
2021-06-19 20:27   ` [tip: smp/urgent] cpu/hotplug: Cure the cpusets trainwreck tip-bot2 for Thomas Gleixner
2021-06-21  8:36   ` tip-bot2 for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.