[RFC PATCH v2 0/2] workqueue: destroy_worker() vs isolated CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/2] workqueue: destroy_worker() vs isolated CPUs
@ 2022-07-27 11:53 Valentin Schneider
  2022-07-27 11:53 ` [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit() Valentin Schneider
  2022-07-27 11:53 ` [RFC PATCH v2 2/2] DEBUG: workqueue: kworker spawner Valentin Schneider
  0 siblings, 2 replies; 9+ messages in thread
From: Valentin Schneider @ 2022-07-27 11:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

Hi folks,

Using a work struct from within the workqueue code itself is a bit scary, but
I've done more rigorous testing than for v1 and it's looking OK (at the very
least locking-wise).

Note that this affects all kworkers (not just percpu ones) for the sake of
consistency and to prevent adding extra corner cases. kthread_set_per_cpu(p, -1)
is a no-op for unbound kworkers, and IIUC the affinity change is not required
since unbound workers have to be affined to a subset of wq_unbound_cpumask, but
it shouldn't be harmful either.

2/2 is a simple and stupid stresser that forces extra pcpu kworkers to be
spawned on a specific CPU - I can then quickly test this on QEMU by making sure
said CPU is isolated on the cmdline.

Thanks to Tejun & Lai for the discussion thus far.

Revisions
=========

RFCv1 -> RFCv2
++++++++++++++

o Change the pool->timer into a delayed_work to have a sleepable context for
  unbinding kworkers

Cheers,
Valentin

Valentin Schneider (2):
  workqueue: Unbind workers before sending them to exit()
  DEBUG: workqueue: kworker spawner

 kernel/Makefile    |   2 +-
 kernel/workqueue.c | 118 ++++++++++++++++++++++++++++++++++-----------
 kernel/wqstress.c  |  69 ++++++++++++++++++++++++++
 3 files changed, 159 insertions(+), 30 deletions(-)
 create mode 100644 kernel/wqstress.c

--
2.31.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-27 11:53 [RFC PATCH v2 0/2] workqueue: destroy_worker() vs isolated CPUs Valentin Schneider
@ 2022-07-27 11:53 ` Valentin Schneider
  2022-07-27 17:13   ` Lai Jiangshan
  2022-07-27 11:53 ` [RFC PATCH v2 2/2] DEBUG: workqueue: kworker spawner Valentin Schneider
  1 sibling, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2022-07-27 11:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

It has been reported that isolated CPUs can suffer from interference due to
per-CPU kworkers waking up just to die.

A surge of workqueue activity during initial setup of a latency-sensitive
application (refresh_vm_stats() being one of the culprits) can cause extra
per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
running merrily on an isolated CPU only to be interrupted sometime later by
a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
kworker activity).

Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
them with WORKER_DIE.

Changing the affinity does require a sleepable context, so get rid of the
pool->idle_timer and use a delayed_work instead.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/workqueue.c | 109 +++++++++++++++++++++++++++++++++------------
 1 file changed, 81 insertions(+), 28 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..27642166dcc5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -167,9 +167,9 @@ struct worker_pool {
 	int			nr_workers;	/* L: total number of workers */
 	int			nr_idle;	/* L: currently idle workers */
 
-	struct list_head	idle_list;	/* L: list of idle workers */
-	struct timer_list	idle_timer;	/* L: worker idle timeout */
-	struct timer_list	mayday_timer;	/* L: SOS timer for workers */
+	struct list_head	idle_list;	  /* L: list of idle workers */
+	struct delayed_work     idle_reaper_work; /* L: worker idle timeout */
+	struct timer_list	mayday_timer;	  /* L: SOS timer for workers */
 
 	/* a workers is either on busy_hash or idle_list, or the manager */
 	DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
@@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &pool->idle_list);
 
-	if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
-		mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
+	if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
+		mod_delayed_work(system_unbound_wq,
+				 &pool->idle_reaper_work,
+				 IDLE_WORKER_TIMEOUT);
 
 	/* Sanity check nr_running. */
 	WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
@@ -1972,9 +1974,29 @@ static struct worker *create_worker(struct worker_pool *pool)
 	return NULL;
 }
 
+static void unbind_worker(struct worker *worker)
+{
+	kthread_set_per_cpu(worker->task, -1);
+	WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
+}
+
+static void rebind_worker(struct worker *worker, struct worker_pool *pool)
+{
+	kthread_set_per_cpu(worker->task, pool->cpu);
+	WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
+}
+
+static void reap_worker(struct worker *worker)
+{
+	list_del_init(&worker->entry);
+	unbind_worker(worker);
+	wake_up_process(worker->task);
+}
+
 /**
  * destroy_worker - destroy a workqueue worker
  * @worker: worker to be destroyed
+ * @list: transfer worker away from its pool->idle_list and into list
  *
  * Destroy @worker and adjust @pool stats accordingly.  The worker should
  * be idle.
@@ -1982,7 +2004,7 @@ static struct worker *create_worker(struct worker_pool *pool)
  * CONTEXT:
  * raw_spin_lock_irq(pool->lock).
  */
-static void destroy_worker(struct worker *worker)
+static void destroy_worker(struct worker *worker, struct list_head *list)
 {
 	struct worker_pool *pool = worker->pool;
 
@@ -1997,34 +2019,64 @@ static void destroy_worker(struct worker *worker)
 	pool->nr_workers--;
 	pool->nr_idle--;
 
-	list_del_init(&worker->entry);
+	list_del(&worker->entry);
 	worker->flags |= WORKER_DIE;
-	wake_up_process(worker->task);
+
+	list_add(&worker->entry, list);
 }
 
-static void idle_worker_timeout(struct timer_list *t)
+/**
+ * idle_reaper_fn - reap workers that have been idle for too long.
+ *
+ * Unbinding marked-for-destruction workers requires a sleepable context, as
+ * changing a task's affinity is not an atomic operation, and we don't want
+ * to disturb isolated CPUs IDLE_WORKER_TIMEOUT in the future just for a kworker
+ * to do_exit().
+ *
+ * Percpu kworkers should meet the conditions for the affinity change to not
+ * block (not migration-disabled and not running), but there is no *hard*
+ * guarantee that they are not running when we get here.
+ *
+ * The delayed_work is only ever modified under raw_spin_lock_irq(pool->lock).
+ */
+static void idle_reaper_fn(struct work_struct *work)
 {
-	struct worker_pool *pool = from_timer(pool, t, idle_timer);
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct worker_pool *pool = container_of(dwork, struct worker_pool, idle_reaper_work);
+	struct list_head reaplist;
+	struct worker *worker, *tmp;
+
+	INIT_LIST_HEAD(&reaplist);
 
 	raw_spin_lock_irq(&pool->lock);
 
 	while (too_many_workers(pool)) {
-		struct worker *worker;
 		unsigned long expires;
+		unsigned long now = jiffies;
 
 		/* idle_list is kept in LIFO order, check the last one */
 		worker = list_entry(pool->idle_list.prev, struct worker, entry);
 		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
 
-		if (time_before(jiffies, expires)) {
-			mod_timer(&pool->idle_timer, expires);
+		/*
+		 * Careful: queueing a work item from here can and will cause a
+		 * self-deadlock when dealing with an unbound pool. However,
+		 * here the delay *cannot* be zero and *has* to be in the
+		 * future, which works.
+		 */
+		if (time_before(now, expires)) {
+			mod_delayed_work(system_unbound_wq,
+					 &pool->idle_reaper_work,
+					 expires - now);
 			break;
 		}
 
-		destroy_worker(worker);
+		destroy_worker(worker, &reaplist);
 	}
-
 	raw_spin_unlock_irq(&pool->lock);
+
+	list_for_each_entry_safe(worker, tmp, &reaplist, entry)
+		reap_worker(worker);
 }
 
 static void send_mayday(struct work_struct *work)
@@ -3454,7 +3506,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	INIT_LIST_HEAD(&pool->idle_list);
 	hash_init(pool->busy_hash);
 
-	timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);
+	INIT_DEFERRABLE_WORK(&pool->idle_reaper_work, idle_reaper_fn);
 
 	timer_setup(&pool->mayday_timer, pool_mayday_timeout, 0);
 
@@ -3559,7 +3611,10 @@ static bool wq_manager_inactive(struct worker_pool *pool)
 static void put_unbound_pool(struct worker_pool *pool)
 {
 	DECLARE_COMPLETION_ONSTACK(detach_completion);
-	struct worker *worker;
+	struct list_head dlist;
+	struct worker *worker, *tmp;
+
+	INIT_LIST_HEAD(&dlist);
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3588,10 +3643,13 @@ static void put_unbound_pool(struct worker_pool *pool)
 	pool->flags |= POOL_MANAGER_ACTIVE;
 
 	while ((worker = first_idle_worker(pool)))
-		destroy_worker(worker);
+		destroy_worker(worker, &dlist);
 	WARN_ON(pool->nr_workers || pool->nr_idle);
 	raw_spin_unlock_irq(&pool->lock);
 
+	list_for_each_entry_safe(worker, tmp, &dlist, entry)
+		reap_worker(worker);
+
 	mutex_lock(&wq_pool_attach_mutex);
 	if (!list_empty(&pool->workers))
 		pool->detach_completion = &detach_completion;
@@ -3601,7 +3659,7 @@ static void put_unbound_pool(struct worker_pool *pool)
 		wait_for_completion(pool->detach_completion);
 
 	/* shut down the timers */
-	del_timer_sync(&pool->idle_timer);
+	cancel_delayed_work_sync(&pool->idle_reaper_work);
 	del_timer_sync(&pool->mayday_timer);
 
 	/* RCU protected to allow dereferences from get_work_pool() */
@@ -4999,10 +5057,8 @@ static void unbind_workers(int cpu)
 
 		raw_spin_unlock_irq(&pool->lock);
 
-		for_each_pool_worker(worker, pool) {
-			kthread_set_per_cpu(worker->task, -1);
-			WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
-		}
+		for_each_pool_worker(worker, pool)
+			unbind_worker(worker);
 
 		mutex_unlock(&wq_pool_attach_mutex);
 	}
@@ -5027,11 +5083,8 @@ static void rebind_workers(struct worker_pool *pool)
 	 * of all workers first and then clear UNBOUND.  As we're called
 	 * from CPU_ONLINE, the following shouldn't fail.
 	 */
-	for_each_pool_worker(worker, pool) {
-		kthread_set_per_cpu(worker->task, pool->cpu);
-		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
-						  pool->attrs->cpumask) < 0);
-	}
+	for_each_pool_worker(worker, pool)
+		rebind_worker(worker, pool);
 
 	raw_spin_lock_irq(&pool->lock);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v2 2/2] DEBUG: workqueue: kworker spawner
  2022-07-27 11:53 [RFC PATCH v2 0/2] workqueue: destroy_worker() vs isolated CPUs Valentin Schneider
  2022-07-27 11:53 ` [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit() Valentin Schneider
@ 2022-07-27 11:53 ` Valentin Schneider
  1 sibling, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2022-07-27 11:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tejun Heo, Lai Jiangshan, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

---
 kernel/Makefile    |  2 +-
 kernel/workqueue.c |  9 +++++-
 kernel/wqstress.c  | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+), 2 deletions(-)
 create mode 100644 kernel/wqstress.c

diff --git a/kernel/Makefile b/kernel/Makefile
index a7e1f49ab2b3..860133f7bca5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    extable.o params.o platform-feature.o \
 	    kthread.o sys_ni.o nsproxy.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
-	    async.o range.o smpboot.o ucount.o regset.o
+	    async.o range.o smpboot.o ucount.o regset.o wqstress.o
 
 obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
 obj-$(CONFIG_MODULES) += kmod.o
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 27642166dcc5..9559d0256683 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -91,7 +91,7 @@ enum {
 	BUSY_WORKER_HASH_ORDER	= 6,		/* 64 pointers */
 
 	MAX_IDLE_WORKERS_RATIO	= 4,		/* 1/4 of busy can be idle */
-	IDLE_WORKER_TIMEOUT	= 300 * HZ,	/* keep idle ones for 5 mins */
+	IDLE_WORKER_TIMEOUT	= 3 * HZ,	/* keep idle ones for 5 mins */
 
 	MAYDAY_INITIAL_TIMEOUT  = HZ / 100 >= 2 ? HZ / 100 : 2,
 						/* call for help after 10ms
@@ -1988,6 +1988,10 @@ static void rebind_worker(struct worker *worker, struct worker_pool *pool)
 
 static void reap_worker(struct worker *worker)
 {
+	pr_info("WORKER_REAP: task=%s cpu=%d this_task=%s this_cpu=%d\n",
+		worker->task->comm, task_cpu(worker->task),
+		current->comm, raw_smp_processor_id());
+
 	list_del_init(&worker->entry);
 	unbind_worker(worker);
 	wake_up_process(worker->task);
@@ -2443,6 +2447,9 @@ static int worker_thread(void *__worker)
 		WARN_ON_ONCE(!list_empty(&worker->entry));
 		set_pf_worker(false);
 
+		pr_info("WORKER_DIE: task=%s this_cpu=%d\n",
+			current->comm, raw_smp_processor_id());
+
 		set_task_comm(worker->task, "kworker/dying");
 		ida_free(&pool->worker_ida, worker->id);
 		worker_detach_from_pool(worker);
diff --git a/kernel/wqstress.c b/kernel/wqstress.c
new file mode 100644
index 000000000000..16a3771027cd
--- /dev/null
+++ b/kernel/wqstress.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+
+MODULE_AUTHOR("Valentin Schneider <vschneid@redhat.com>");
+MODULE_LICENSE("GPL");
+
+#define TARGET_CPU 3
+
+static void wqstress_workfn(struct work_struct *work)
+{
+	schedule_timeout_interruptible(10 * HZ);
+}
+
+#define DECL_WORK(n) static DECLARE_WORK(wqstress_work_##n, wqstress_workfn)
+#define KICK_WORK(n) do {						\
+		schedule_work_on(TARGET_CPU, &wqstress_work_##n);	\
+	} while (0);
+#define FLUSH_WORK(n) do {			\
+		flush_work(&wqstress_work_##n);	\
+	} while (0);
+
+DECL_WORK(0);
+DECL_WORK(1);
+DECL_WORK(2);
+DECL_WORK(3);
+DECL_WORK(4);
+DECL_WORK(5);
+DECL_WORK(6);
+DECL_WORK(7);
+DECL_WORK(8);
+DECL_WORK(9);
+
+/*
+ * This should create ≈(N-1) extra kworkers for N kicked work
+ */
+static int __init wqstress_init(void)
+{
+	pr_info("WQSTRESS START\n");
+
+	sched_set_fifo_low(current);
+
+	KICK_WORK(0);
+	KICK_WORK(1);
+	KICK_WORK(2);
+	KICK_WORK(3);
+	KICK_WORK(4);
+	KICK_WORK(5);
+	KICK_WORK(6);
+	KICK_WORK(7);
+	KICK_WORK(8);
+	KICK_WORK(9);
+
+	FLUSH_WORK(0);
+	FLUSH_WORK(1);
+	FLUSH_WORK(2);
+	FLUSH_WORK(3);
+	FLUSH_WORK(4);
+	FLUSH_WORK(5);
+	FLUSH_WORK(6);
+	FLUSH_WORK(7);
+	FLUSH_WORK(8);
+	FLUSH_WORK(9);
+
+	return 0;
+}
+
+late_initcall_sync(wqstress_init);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-27 11:53 ` [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit() Valentin Schneider
@ 2022-07-27 17:13   ` Lai Jiangshan
  2022-07-28 10:54     ` Valentin Schneider
  0 siblings, 1 reply; 9+ messages in thread
From: Lai Jiangshan @ 2022-07-27 17:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: LKML, Tejun Heo, Peter Zijlstra, Frederic Weisbecker, Juri Lelli,
	Phil Auld, Marcelo Tosatti

Quick review before going to sleep.

On Wed, Jul 27, 2022 at 7:54 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> It has been reported that isolated CPUs can suffer from interference due to
> per-CPU kworkers waking up just to die.
>
> A surge of workqueue activity during initial setup of a latency-sensitive
> application (refresh_vm_stats() being one of the culprits) can cause extra
> per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
> running merrily on an isolated CPU only to be interrupted sometime later by
> a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
> kworker activity).
>
> Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
> contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
> them with WORKER_DIE.
>
> Changing the affinity does require a sleepable context, so get rid of the
> pool->idle_timer and use a delayed_work instead.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  kernel/workqueue.c | 109 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 81 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..27642166dcc5 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -167,9 +167,9 @@ struct worker_pool {
>         int                     nr_workers;     /* L: total number of workers */
>         int                     nr_idle;        /* L: currently idle workers */
>
> -       struct list_head        idle_list;      /* L: list of idle workers */
> -       struct timer_list       idle_timer;     /* L: worker idle timeout */
> -       struct timer_list       mayday_timer;   /* L: SOS timer for workers */
> +       struct list_head        idle_list;        /* L: list of idle workers */
> +       struct delayed_work     idle_reaper_work; /* L: worker idle timeout */
> +       struct timer_list       mayday_timer;     /* L: SOS timer for workers */
>
>         /* a workers is either on busy_hash or idle_list, or the manager */
>         DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
> @@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
>         /* idle_list is LIFO */
>         list_add(&worker->entry, &pool->idle_list);
>
> -       if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
> -               mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
> +       if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
> +               mod_delayed_work(system_unbound_wq,
> +                                &pool->idle_reaper_work,
> +                                IDLE_WORKER_TIMEOUT);

system_unbound_wq doesn't have a rescuer.

A new workqueue with a rescuer needs to be created and used for
this purpose.

>
>         /* Sanity check nr_running. */
>         WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
> @@ -1972,9 +1974,29 @@ static struct worker *create_worker(struct worker_pool *pool)
>         return NULL;
>  }
>
> +static void unbind_worker(struct worker *worker)
> +{
> +       kthread_set_per_cpu(worker->task, -1);
> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> +}
> +
> +static void rebind_worker(struct worker *worker, struct worker_pool *pool)
> +{
> +       kthread_set_per_cpu(worker->task, pool->cpu);
> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> +}
> +
> +static void reap_worker(struct worker *worker)
> +{
> +       list_del_init(&worker->entry);
> +       unbind_worker(worker);
> +       wake_up_process(worker->task);


Since WORKER_DIE is set, the worker can be possible freed now
if there is another source to wake it up.

I think reverting a part of the commit 60f5a4bcf852("workqueue:
async worker destruction") to make use of kthread_stop()
in destroy_worker() should be a good idea.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-27 17:13   ` Lai Jiangshan
@ 2022-07-28 10:54     ` Valentin Schneider
  2022-07-28 16:35       ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2022-07-28 10:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Tejun Heo, Peter Zijlstra, Frederic Weisbecker, Juri Lelli,
	Phil Auld, Marcelo Tosatti

On 28/07/22 01:13, Lai Jiangshan wrote:
> Quick review before going to sleep.
>

Thanks!

> On Wed, Jul 27, 2022 at 7:54 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> @@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
>>         /* idle_list is LIFO */
>>         list_add(&worker->entry, &pool->idle_list);
>>
>> -       if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
>> -               mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
>> +       if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
>> +               mod_delayed_work(system_unbound_wq,
>> +                                &pool->idle_reaper_work,
>> +                                IDLE_WORKER_TIMEOUT);
>
> system_unbound_wq doesn't have a rescuer.
>
> A new workqueue with a rescuer needs to be created and used for
> this purpose.
>

Right, I think it makes sense for those work items to be attached to a
WQ_MEM_RECLAIM workqueue. Should I add that as a workqueue-internal
thing?

>>
>>         /* Sanity check nr_running. */
>>         WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
>> @@ -1972,9 +1974,29 @@ static struct worker *create_worker(struct worker_pool *pool)
>>         return NULL;
>>  }
>>
>> +static void unbind_worker(struct worker *worker)
>> +{
>> +       kthread_set_per_cpu(worker->task, -1);
>> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
>> +}
>> +
>> +static void rebind_worker(struct worker *worker, struct worker_pool *pool)
>> +{
>> +       kthread_set_per_cpu(worker->task, pool->cpu);
>> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
>> +}
>> +
>> +static void reap_worker(struct worker *worker)
>> +{
>> +       list_del_init(&worker->entry);
>> +       unbind_worker(worker);
>> +       wake_up_process(worker->task);
>
>
> Since WORKER_DIE is set, the worker can be possible freed now
> if there is another source to wake it up.
>

My understanding for having reap_worker() be "safe" to use outside of
raw_spin_lock_irq(pool->lock) is that pool->idle_list is never accessed
outside of the pool->lock, and wake_up_worker() only wakes a worker that
is in that list. So with destroy_worker() detaching the worker from
pool->idle_list under pool->lock, I'm not aware of a codepath other than
reap_worker() that could wake it up.

The only wake_up_process() I see that doesn't involve the pool->idle_list
is in send_mayday(), but AFAIA rescuers can never end up in the idle_list
and are specifically destroyed in destroy_workqueue().

> I think reverting a part of the commit 60f5a4bcf852("workqueue:
> async worker destruction") to make use of kthread_stop()
> in destroy_worker() should be a good idea.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-28 10:54     ` Valentin Schneider
@ 2022-07-28 16:35       ` Tejun Heo
  2022-07-28 17:24         ` Valentin Schneider
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2022-07-28 16:35 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Lai Jiangshan, LKML, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

On Thu, Jul 28, 2022 at 11:54:19AM +0100, Valentin Schneider wrote:
> On 28/07/22 01:13, Lai Jiangshan wrote:
> > Quick review before going to sleep.
> >
> 
> Thanks!
> 
> > On Wed, Jul 27, 2022 at 7:54 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> @@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
> >>         /* idle_list is LIFO */
> >>         list_add(&worker->entry, &pool->idle_list);
> >>
> >> -       if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
> >> -               mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
> >> +       if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
> >> +               mod_delayed_work(system_unbound_wq,
> >> +                                &pool->idle_reaper_work,
> >> +                                IDLE_WORKER_TIMEOUT);
> >
> > system_unbound_wq doesn't have a rescuer.
> >
> > A new workqueue with a rescuer needs to be created and used for
> > this purpose.
> >
> 
> Right, I think it makes sense for those work items to be attached to a
> WQ_MEM_RECLAIM workqueue. Should I add that as a workqueue-internal
> thing?

I don't understand why this would need MEM_RECLAIM when it isn't sitting in
the memory reclaim path. Nothing in mm side can wait on this.

> > Since WORKER_DIE is set, the worker can be possible freed now
> > if there is another source to wake it up.
> >
> 
> My understanding for having reap_worker() be "safe" to use outside of
> raw_spin_lock_irq(pool->lock) is that pool->idle_list is never accessed
> outside of the pool->lock, and wake_up_worker() only wakes a worker that
> is in that list. So with destroy_worker() detaching the worker from
> pool->idle_list under pool->lock, I'm not aware of a codepath other than
> reap_worker() that could wake it up.

There actually are spurious wakeups. We can't depend on there being no
wakeups than ours.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-28 16:35       ` Tejun Heo
@ 2022-07-28 17:24         ` Valentin Schneider
  2022-07-28 17:31           ` Tejun Heo
  0 siblings, 1 reply; 9+ messages in thread
From: Valentin Schneider @ 2022-07-28 17:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, LKML, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

On 28/07/22 06:35, Tejun Heo wrote:
> On Thu, Jul 28, 2022 at 11:54:19AM +0100, Valentin Schneider wrote:
>> On 28/07/22 01:13, Lai Jiangshan wrote:
>> > system_unbound_wq doesn't have a rescuer.
>> >
>> > A new workqueue with a rescuer needs to be created and used for
>> > this purpose.
>> >
>>
>> Right, I think it makes sense for those work items to be attached to a
>> WQ_MEM_RECLAIM workqueue. Should I add that as a workqueue-internal
>> thing?
>
> I don't understand why this would need MEM_RECLAIM when it isn't sitting in
> the memory reclaim path. Nothing in mm side can wait on this.
>

Vaguely reading the doc I thought that'd be for anything that would
directly or indirectly help with reclaiming memory (not explicitly sitting
in some *mm reclaim* path), and I assumed freeing up a worker would count as
that - but that's the understanding of someone who doesn't know much about
all that :-)

>> > Since WORKER_DIE is set, the worker can be possible freed now
>> > if there is another source to wake it up.
>> >
>>
>> My understanding for having reap_worker() be "safe" to use outside of
>> raw_spin_lock_irq(pool->lock) is that pool->idle_list is never accessed
>> outside of the pool->lock, and wake_up_worker() only wakes a worker that
>> is in that list. So with destroy_worker() detaching the worker from
>> pool->idle_list under pool->lock, I'm not aware of a codepath other than
>> reap_worker() that could wake it up.
>
> There actually are spurious wakeups. We can't depend on there being no
> wakeups than ours.
>

Myes, I suppose if a to-be-destroyed kworker spuriously wakes before having
been unbound then there's not much point in having the unbinding (harm has
been done and the kworker can do_exit(), though arguably we could reduce
the harm and still move it away), but let me see what I can do here.

> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-28 17:24         ` Valentin Schneider
@ 2022-07-28 17:31           ` Tejun Heo
  2022-07-29 10:12             ` Valentin Schneider
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2022-07-28 17:31 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Lai Jiangshan, LKML, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

Hello,

On Thu, Jul 28, 2022 at 06:24:17PM +0100, Valentin Schneider wrote:
> > I don't understand why this would need MEM_RECLAIM when it isn't sitting in
> > the memory reclaim path. Nothing in mm side can wait on this.
> 
> Vaguely reading the doc I thought that'd be for anything that would
> directly or indirectly help with reclaiming memory (not explicitly sitting
> in some *mm reclaim* path), and I assumed freeing up a worker would count as
> that - but that's the understanding of someone who doesn't know much about
> all that :-)

Oh, it's just needed for things that mm might end up waiting on. Here,
there's no way for mm to know about or trigger this at all, so it doesn't
need the flag.

> > There actually are spurious wakeups. We can't depend on there being no
> > wakeups than ours.
> 
> Myes, I suppose if a to-be-destroyed kworker spuriously wakes before having
> been unbound then there's not much point in having the unbinding (harm has
> been done and the kworker can do_exit(), though arguably we could reduce
> the harm and still move it away), but let me see what I can do here.

Yeah, it kinda sucks but is a kernel-wide thing and pretty rare, so for the
most part, we can pretend that they don't exist but under specific
conditions, there can be asynchronous wakeups coming from whereever, so we
gotta be crash proof against those.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
  2022-07-28 17:31           ` Tejun Heo
@ 2022-07-29 10:12             ` Valentin Schneider
  0 siblings, 0 replies; 9+ messages in thread
From: Valentin Schneider @ 2022-07-29 10:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, LKML, Peter Zijlstra, Frederic Weisbecker,
	Juri Lelli, Phil Auld, Marcelo Tosatti

On 28/07/22 07:31, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 06:24:17PM +0100, Valentin Schneider wrote:
>> > I don't understand why this would need MEM_RECLAIM when it isn't sitting in
>> > the memory reclaim path. Nothing in mm side can wait on this.
>>
>> Vaguely reading the doc I thought that'd be for anything that would
>> directly or indirectly help with reclaiming memory (not explicitly sitting
>> in some *mm reclaim* path), and I assumed freeing up a worker would count as
>> that - but that's the understanding of someone who doesn't know much about
>> all that :-)
>
> Oh, it's just needed for things that mm might end up waiting on. Here,
> there's no way for mm to know about or trigger this at all, so it doesn't
> need the flag.
>

Got it, thanks!

>> > There actually are spurious wakeups. We can't depend on there being no
>> > wakeups than ours.
>>
>> Myes, I suppose if a to-be-destroyed kworker spuriously wakes before having
>> been unbound then there's not much point in having the unbinding (harm has
>> been done and the kworker can do_exit(), though arguably we could reduce
>> the harm and still move it away), but let me see what I can do here.
>
> Yeah, it kinda sucks but is a kernel-wide thing and pretty rare, so for the
> most part, we can pretend that they don't exist but under specific
> conditions, there can be asynchronous wakeups coming from whereever, so we
> gotta be crash proof against those.
>

That's sensible, I'll look into Lai's suggestion and see if I can come up
with something not-too-horrible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-29 10:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-27 11:53 [RFC PATCH v2 0/2] workqueue: destroy_worker() vs isolated CPUs Valentin Schneider
2022-07-27 11:53 ` [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit() Valentin Schneider
2022-07-27 17:13   ` Lai Jiangshan
2022-07-28 10:54     ` Valentin Schneider
2022-07-28 16:35       ` Tejun Heo
2022-07-28 17:24         ` Valentin Schneider
2022-07-28 17:31           ` Tejun Heo
2022-07-29 10:12             ` Valentin Schneider
2022-07-27 11:53 ` [RFC PATCH v2 2/2] DEBUG: workqueue: kworker spawner Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.