[RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups
@ 2022-08-04  8:41 Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Linus Torvalds, Eric W. Biederman, Tejun Heo,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Patch1-3 are fixes for prematurely wakeups and patch4-8 are cleanups.

Patch2 fixes when prematurely wakeup happens after kthread_bind_mask().
Patch3 fixes when prematurely wakeup happens before kthread_bind_mask().
Patch1 prepares for patch2-3.

Like Petr's patch[1], a completion is introduced to do the synchronization,
but the synchronization is done in a different direction which allows the
newly created worker itself do some initialization instead of the manager
and allows for a more simplified code.
(The changed synchronization direction is not necessarily better.)

And make workqueue code less dependence on the semantics that kthread
provides.

[1]: https://lore.kernel.org/all/20220622140853.31383-1-pmladek@suse.com/
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>,
Cc: Petr Mladek <pmladek@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Wedson Almeida Filho <wedsonaf@google.com>

Lai Jiangshan (8):
  workqueue: Unconditionally set cpumask in worker_attach_to_pool()
  workqueue: Make create_worker() safe against prematurely wakeups
  workqueue: Set PF_NO_SETAFFINITY instead of kthread_bind_mask()
  workqueue: Set/Clear PF_WQ_WORKER while attaching/detaching
  workqueue: Use worker_set_flags() in worker_enter_idle()
  workqueue: Simplify the starting of the newly created worker
  workqueue: Remove the outer loop in maybe_create_worker()
  workqueue: Move the locking out of maybe_create_worker()

 kernel/workqueue.c          | 123 +++++++++++++++---------------------
 kernel/workqueue_internal.h |  11 +++-
 2 files changed, 60 insertions(+), 74 deletions(-)

-- 
2.19.1.6.gb485710b

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool()
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-16 21:18   ` Tejun Heo
  2022-09-12  7:54   ` Peter Zijlstra
  2022-08-04  8:41 ` [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups Lai Jiangshan
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Linus Torvalds, Eric W. Biederman, Tejun Heo,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Lai Jiangshan, Valentin Schneider

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

If a worker is spuriously woken up after kthread_bind_mask() but before
worker_attach_to_pool(), and there are some cpu-hot-[un]plug happening
during the same interval, the worker task might be pushed away from its
bound CPU with its affinity changed by the scheduler and worker_attach_to_pool()
doesn't rebind it properly.

Do unconditionally affinity binding in worker_attach_to_pool() to fix
the problem.

Prepare for moving worker_attach_to_pool() from create_worker() to the
starting of worker_thread() which will really cause the said interval
even without spurious wakeup.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>,
Cc: Petr Mladek <pmladek@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Wedson Almeida Filho <wedsonaf@google.com>
Fixes: 640f17c82460 ("workqueue: Restrict affinity change to rescuer")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..928aad7d6123 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1872,8 +1872,11 @@ static void worker_attach_to_pool(struct worker *worker,
 	else
 		kthread_set_per_cpu(worker->task, pool->cpu);
 
-	if (worker->rescue_wq)
-		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+	/*
+	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
+	 */
+	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
 
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
       [not found]   ` <20220804123520.1660-1-hdanton@sina.com>
  2022-08-16 21:46   ` Tejun Heo
  2022-08-04  8:41 ` [RFC PATCH 3/8] workqueue: Set PF_NO_SETAFFINITY instead of kthread_bind_mask() Lai Jiangshan
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Linus Torvalds, Eric W. Biederman, Tejun Heo,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

A system crashed with the following BUG() report:

  [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
  [115147.050488] #PF: supervisor write access in kernel mode
  [115147.050489] #PF: error_code(0x0002) - not-present page
  [115147.050490] PGD 0 P4D 0
  [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
  [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G           O   X    5.3.18-2-rt #1 SLE15-SP2 (unreleased)
  [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30
  [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90
  [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046
  [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000
  [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340
  [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0
  [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780
  [115147.050521] FS:  0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000
  [115147.050523] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0
  [115147.050524] Call Trace:
  [115147.050533]  worker_thread+0xb4/0x3c0
  [115147.050538]  ? process_one_work+0x4a0/0x4a0
  [115147.050540]  kthread+0x152/0x170
  [115147.050542]  ? kthread_park+0xa0/0xa0
  [115147.050544]  ret_from_fork+0x35/0x40

Further debugging shown that the worker thread was woken before
worker_attach_to_pool() finished in create_worker() though the reason
why it was woken up is unknown yet which might be some real-time kernel
activities.

Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
until it is explicitly woken. But a spurious wakeup might
break this expectation.

As a result, worker_thread() might read worker->pool before
it was set in worker create_worker() by worker_attach_to_pool().
Or it might leave idle before entering idle or process work items
before attached for the pool.

Also manage_workers() might want to create yet another worker
before worker->pool->nr_workers is updated. It is a kind off
a chicken & egg problem.

Synchronize these operations using a completion API.  There are two
ways for the synchronization: either the manager does the worker
initialization and the newly created worker waits for the completion of
the initialization or the newly created worker does the worker
initialization for itself and the manager waits for the completion.

In current code, the manager does the worker initialization with the
dependence that kthread API ensure the worker to be TASK_UNINTERRUPTIBLE.

The ensuring is fragile, so one way of the synchronizations should be
chosen and the dependence should be avoided.

The newly created worker doing the worker initialization can simplify
the code further, so the second way is chosen.

Note that worker->pool might be then read without wq_pool_attach_mutex.
Normal worker always belongs to the same pool and the locking rules
for it is updated a bit.

Also note that rescuer_thread() does not need this because all
needed values are set before the kthreads is created. It is
connected with a particular workqueue. It is attached to different
pools when needed.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>,
Cc: Petr Mladek <pmladek@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c          | 22 ++++++++++++++--------
 kernel/workqueue_internal.h | 11 +++++++++--
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 928aad7d6123..f5b12c6778cc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -176,6 +176,7 @@ struct worker_pool {
 						/* L: hash of busy workers */
 
 	struct worker		*manager;	/* L: purely informational */
+	struct completion	created;	/* create_worker(): worker created */
 	struct list_head	workers;	/* A: attached workers */
 	struct completion	*detach_completion; /* all workers detached */
 
@@ -1942,6 +1943,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 		goto fail;
 
 	worker->id = id;
+	worker->pool = pool;
 
 	if (pool->cpu >= 0)
 		snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
@@ -1949,6 +1951,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
+	reinit_completion(&pool->created);
 	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
 					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
@@ -1957,15 +1960,9 @@ static struct worker *create_worker(struct worker_pool *pool)
 	set_user_nice(worker->task, pool->attrs->nice);
 	kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
-	/* successful, attach the worker to the pool */
-	worker_attach_to_pool(worker, pool);
-
 	/* start the newly created worker */
-	raw_spin_lock_irq(&pool->lock);
-	worker->pool->nr_workers++;
-	worker_enter_idle(worker);
 	wake_up_process(worker->task);
-	raw_spin_unlock_irq(&pool->lock);
+	wait_for_completion(&pool->created);
 
 	return worker;
 
@@ -2383,10 +2380,17 @@ static int worker_thread(void *__worker)
 	struct worker *worker = __worker;
 	struct worker_pool *pool = worker->pool;
 
+	/* attach the worker to the pool */
+	worker_attach_to_pool(worker, pool);
+
 	/* tell the scheduler that this is a workqueue worker */
 	set_pf_worker(true);
-woke_up:
+
 	raw_spin_lock_irq(&pool->lock);
+	worker->pool->nr_workers++;
+	worker_enter_idle(worker);
+	complete(&pool->created);
+woke_up:
 
 	/* am I supposed to die? */
 	if (unlikely(worker->flags & WORKER_DIE)) {
@@ -2458,6 +2462,7 @@ static int worker_thread(void *__worker)
 	__set_current_state(TASK_IDLE);
 	raw_spin_unlock_irq(&pool->lock);
 	schedule();
+	raw_spin_lock_irq(&pool->lock);
 	goto woke_up;
 }
 
@@ -3461,6 +3466,7 @@ static int init_worker_pool(struct worker_pool *pool)
 
 	timer_setup(&pool->mayday_timer, pool_mayday_timeout, 0);
 
+	init_completion(&pool->created);
 	INIT_LIST_HEAD(&pool->workers);
 
 	ida_init(&pool->worker_ida);
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index e00b1204a8e9..025861c4d1f6 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -37,8 +37,15 @@ struct worker {
 	/* 64 bytes boundary on 64bit, 32 on 32bit */
 
 	struct task_struct	*task;		/* I: worker task */
-	struct worker_pool	*pool;		/* A: the associated pool */
-						/* L: for rescuers */
+
+	/*
+	 * The associated pool, locking rules:
+	 *   PF_WQ_WORKER: from the current worker
+	 *   PF_WQ_WORKER && wq_pool_attach_mutex: from remote tasks
+	 *   None: from the current worker when the worker is coming up
+	 */
+	struct worker_pool	*pool;
+
 	struct list_head	node;		/* A: anchored at pool->workers */
 						/* A: runs through worker->node */
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/8] workqueue: Set PF_NO_SETAFFINITY instead of kthread_bind_mask()
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 4/8] workqueue: Set/Clear PF_WQ_WORKER while attaching/detaching Lai Jiangshan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Linus Torvalds, Eric W. Biederman, Tejun Heo,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

kthread_bind_mask() can't work correctly if spurious wakeup
happens before kthread_bind_mask().

And a spuriously wakeup worker's cpumask can be possibly changed
by a userspace if worker_attach_to_pool() is called earlier than
kthread_bind_mask().

To avoid the problem caused by spurious wokeup, set PF_NO_SETAFFINITY
at the starting of workers where kthread_bind_mask() can't be used
and luckily workqueue code binds cpumask by itself, all it needs is
only PF_NO_SETAFFINITY.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>,
Cc: Petr Mladek <pmladek@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f5b12c6778cc..82937c0fb21f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1958,7 +1958,6 @@ static struct worker *create_worker(struct worker_pool *pool)
 		goto fail;
 
 	set_user_nice(worker->task, pool->attrs->nice);
-	kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
 	/* start the newly created worker */
 	wake_up_process(worker->task);
@@ -2380,6 +2379,8 @@ static int worker_thread(void *__worker)
 	struct worker *worker = __worker;
 	struct worker_pool *pool = worker->pool;
 
+	current->flags |= PF_NO_SETAFFINITY;
+
 	/* attach the worker to the pool */
 	worker_attach_to_pool(worker, pool);
 
@@ -2494,6 +2495,7 @@ static int rescuer_thread(void *__rescuer)
 	struct list_head *scheduled = &rescuer->scheduled;
 	bool should_stop;
 
+	current->flags |= PF_NO_SETAFFINITY;
 	set_user_nice(current, RESCUER_NICE_LEVEL);
 
 	/*
@@ -4279,7 +4281,6 @@ static int init_rescuer(struct workqueue_struct *wq)
 	}
 
 	wq->rescuer = rescuer;
-	kthread_bind_mask(rescuer->task, cpu_possible_mask);
 	wake_up_process(rescuer->task);
 
 	return 0;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/8] workqueue: Set/Clear PF_WQ_WORKER while attaching/detaching
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
                   ` (2 preceding siblings ...)
  2022-08-04  8:41 ` [RFC PATCH 3/8] workqueue: Set PF_NO_SETAFFINITY instead of kthread_bind_mask() Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 5/8] workqueue: Use worker_set_flags() in worker_enter_idle() Lai Jiangshan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, Tejun Heo, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

PF_WQ_WORKER is only needed when the worker is attached.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 82937c0fb21f..7fc4c2fa21d6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1861,6 +1861,8 @@ static struct worker *alloc_worker(int node)
 static void worker_attach_to_pool(struct worker *worker,
 				   struct worker_pool *pool)
 {
+	WARN_ON_ONCE(worker->task != current);
+
 	mutex_lock(&wq_pool_attach_mutex);
 
 	/*
@@ -1882,6 +1884,9 @@ static void worker_attach_to_pool(struct worker *worker,
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
 
+	/* tell the scheduler that this is a workqueue worker */
+	current->flags |= PF_WQ_WORKER;
+
 	mutex_unlock(&wq_pool_attach_mutex);
 }
 
@@ -1898,8 +1903,11 @@ static void worker_detach_from_pool(struct worker *worker)
 	struct worker_pool *pool = worker->pool;
 	struct completion *detach_completion = NULL;
 
+	WARN_ON_ONCE(worker->task != current);
+
 	mutex_lock(&wq_pool_attach_mutex);
 
+	current->flags &= ~PF_WQ_WORKER;
 	kthread_set_per_cpu(worker->task, -1);
 	list_del(&worker->node);
 	worker->pool = NULL;
@@ -2352,16 +2360,6 @@ static void process_scheduled_works(struct worker *worker)
 	}
 }
 
-static void set_pf_worker(bool val)
-{
-	mutex_lock(&wq_pool_attach_mutex);
-	if (val)
-		current->flags |= PF_WQ_WORKER;
-	else
-		current->flags &= ~PF_WQ_WORKER;
-	mutex_unlock(&wq_pool_attach_mutex);
-}
-
 /**
  * worker_thread - the worker thread function
  * @__worker: self
@@ -2384,9 +2382,6 @@ static int worker_thread(void *__worker)
 	/* attach the worker to the pool */
 	worker_attach_to_pool(worker, pool);
 
-	/* tell the scheduler that this is a workqueue worker */
-	set_pf_worker(true);
-
 	raw_spin_lock_irq(&pool->lock);
 	worker->pool->nr_workers++;
 	worker_enter_idle(worker);
@@ -2397,7 +2392,6 @@ static int worker_thread(void *__worker)
 	if (unlikely(worker->flags & WORKER_DIE)) {
 		raw_spin_unlock_irq(&pool->lock);
 		WARN_ON_ONCE(!list_empty(&worker->entry));
-		set_pf_worker(false);
 
 		set_task_comm(worker->task, "kworker/dying");
 		ida_free(&pool->worker_ida, worker->id);
@@ -2498,11 +2492,6 @@ static int rescuer_thread(void *__rescuer)
 	current->flags |= PF_NO_SETAFFINITY;
 	set_user_nice(current, RESCUER_NICE_LEVEL);
 
-	/*
-	 * Mark rescuer as worker too.  As WORKER_PREP is never cleared, it
-	 * doesn't participate in concurrency management.
-	 */
-	set_pf_worker(true);
 repeat:
 	set_current_state(TASK_IDLE);
 
@@ -2531,6 +2520,10 @@ static int rescuer_thread(void *__rescuer)
 
 		raw_spin_unlock_irq(&wq_mayday_lock);
 
+		/*
+		 * Attach the rescuer.  As WORKER_PREP is never cleared, it
+		 * doesn't participate in concurrency management.
+		 */
 		worker_attach_to_pool(rescuer, pool);
 
 		raw_spin_lock_irq(&pool->lock);
@@ -2600,7 +2593,6 @@ static int rescuer_thread(void *__rescuer)
 
 	if (should_stop) {
 		__set_current_state(TASK_RUNNING);
-		set_pf_worker(false);
 		return 0;
 	}
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 5/8] workqueue: Use worker_set_flags() in worker_enter_idle()
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
                   ` (3 preceding siblings ...)
  2022-08-04  8:41 ` [RFC PATCH 4/8] workqueue: Set/Clear PF_WQ_WORKER while attaching/detaching Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 6/8] workqueue: Simplify the starting of the newly created worker Lai Jiangshan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, Tejun Heo, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

worker_enter_idle() is only called in worker_thread() now.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7fc4c2fa21d6..afe62649fb3a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1799,8 +1799,7 @@ static void worker_enter_idle(struct worker *worker)
 			 (worker->hentry.next || worker->hentry.pprev)))
 		return;
 
-	/* can't use worker_set_flags(), also called from create_worker() */
-	worker->flags |= WORKER_IDLE;
+	worker_set_flags(worker, WORKER_IDLE);
 	pool->nr_idle++;
 	worker->last_active = jiffies;
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 6/8] workqueue: Simplify the starting of the newly created worker
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
                   ` (4 preceding siblings ...)
  2022-08-04  8:41 ` [RFC PATCH 5/8] workqueue: Use worker_set_flags() in worker_enter_idle() Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker() Lai Jiangshan
  2022-08-04  8:41 ` [RFC PATCH 8/8] workqueue: Move the locking out of maybe_create_worker() Lai Jiangshan
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, Tejun Heo, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

There is no point for the newly created worker entering and leaving
idle in the same pool->lock held region and checking WORKER_DIE
because it has not added to idle_list before the lock.

Avoid the code in the starting of the newly created worker and move
those code to the end of the worker_thread().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index afe62649fb3a..64dc1833d11a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2383,31 +2383,16 @@ static int worker_thread(void *__worker)
 
 	raw_spin_lock_irq(&pool->lock);
 	worker->pool->nr_workers++;
-	worker_enter_idle(worker);
 	complete(&pool->created);
-woke_up:
-
-	/* am I supposed to die? */
-	if (unlikely(worker->flags & WORKER_DIE)) {
-		raw_spin_unlock_irq(&pool->lock);
-		WARN_ON_ONCE(!list_empty(&worker->entry));
-
-		set_task_comm(worker->task, "kworker/dying");
-		ida_free(&pool->worker_ida, worker->id);
-		worker_detach_from_pool(worker);
-		kfree(worker);
-		return 0;
-	}
 
-	worker_leave_idle(worker);
-recheck:
+loop:
 	/* no more worker necessary? */
 	if (!need_more_worker(pool))
 		goto sleep;
 
 	/* do we need to manage? */
 	if (unlikely(!may_start_working(pool)) && manage_workers(worker))
-		goto recheck;
+		goto loop;
 
 	/*
 	 * ->scheduled list can only be filled while a worker is
@@ -2457,7 +2442,21 @@ static int worker_thread(void *__worker)
 	raw_spin_unlock_irq(&pool->lock);
 	schedule();
 	raw_spin_lock_irq(&pool->lock);
-	goto woke_up;
+
+	/* am I supposed to die? */
+	if (unlikely(worker->flags & WORKER_DIE)) {
+		raw_spin_unlock_irq(&pool->lock);
+		WARN_ON_ONCE(!list_empty(&worker->entry));
+
+		set_task_comm(worker->task, "kworker/dying");
+		ida_free(&pool->worker_ida, worker->id);
+		worker_detach_from_pool(worker);
+		kfree(worker);
+		return 0;
+	}
+
+	worker_leave_idle(worker);
+	goto loop;
 }
 
 /**
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker()
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
                   ` (5 preceding siblings ...)
  2022-08-04  8:41 ` [RFC PATCH 6/8] workqueue: Simplify the starting of the newly created worker Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  2022-08-16 22:08   ` Tejun Heo
  2022-08-04  8:41 ` [RFC PATCH 8/8] workqueue: Move the locking out of maybe_create_worker() Lai Jiangshan
  7 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, Tejun Heo, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

worker_thread() always does the recheck after getting the manager role,
so the recheck in the maybe_create_worker() is unneeded and is removed.

A piece of comment for maybe_create_worker() is removed because it is
not true anymore after the recheck is removed.

A piece of comment for manage_workers() is removed because there is
already another piece of comment explaining the same thing which is
more accurate.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 64dc1833d11a..0d9844b81482 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2091,9 +2091,6 @@ static void pool_mayday_timeout(struct timer_list *t)
  * sent to all rescuers with works scheduled on @pool to resolve
  * possible allocation deadlock.
  *
- * On return, need_to_create_worker() is guaranteed to be %false and
- * may_start_working() %true.
- *
  * LOCKING:
  * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
  * multiple times.  Does GFP_KERNEL allocations.  Called only from
@@ -2103,7 +2100,6 @@ static void maybe_create_worker(struct worker_pool *pool)
 __releases(&pool->lock)
 __acquires(&pool->lock)
 {
-restart:
 	raw_spin_unlock_irq(&pool->lock);
 
 	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
@@ -2121,13 +2117,6 @@ __acquires(&pool->lock)
 
 	del_timer_sync(&pool->mayday_timer);
 	raw_spin_lock_irq(&pool->lock);
-	/*
-	 * This is necessary even after a new worker was just successfully
-	 * created as @pool->lock was dropped and the new worker might have
-	 * already become busy.
-	 */
-	if (need_to_create_worker(pool))
-		goto restart;
 }
 
 /**
@@ -2138,10 +2127,6 @@ __acquires(&pool->lock)
  * to.  At any given time, there can be only zero or one manager per
  * pool.  The exclusion is handled automatically by this function.
  *
- * The caller can safely start processing works on false return.  On
- * true return, it's guaranteed that need_to_create_worker() is false
- * and may_start_working() is true.
- *
  * CONTEXT:
  * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
  * multiple times.  Does GFP_KERNEL allocations.
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 8/8] workqueue: Move the locking out of maybe_create_worker()
  2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
                   ` (6 preceding siblings ...)
  2022-08-04  8:41 ` [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker() Lai Jiangshan
@ 2022-08-04  8:41 ` Lai Jiangshan
  7 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-04  8:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lai Jiangshan, Tejun Heo, Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

The code is cleaner if the reversed pair of unlock() and lock()
is moved into the caller.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 kernel/workqueue.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0d9844b81482..013ad61e67b9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2091,17 +2091,10 @@ static void pool_mayday_timeout(struct timer_list *t)
  * sent to all rescuers with works scheduled on @pool to resolve
  * possible allocation deadlock.
  *
- * LOCKING:
- * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
- * multiple times.  Does GFP_KERNEL allocations.  Called only from
- * manager.
+ * Does GFP_KERNEL allocations.  Called only from manager.
  */
 static void maybe_create_worker(struct worker_pool *pool)
-__releases(&pool->lock)
-__acquires(&pool->lock)
 {
-	raw_spin_unlock_irq(&pool->lock);
-
 	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
 	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
 
@@ -2116,7 +2109,6 @@ __acquires(&pool->lock)
 	}
 
 	del_timer_sync(&pool->mayday_timer);
-	raw_spin_lock_irq(&pool->lock);
 }
 
 /**
@@ -2147,7 +2139,9 @@ static bool manage_workers(struct worker *worker)
 	pool->flags |= POOL_MANAGER_ACTIVE;
 	pool->manager = worker;
 
+	raw_spin_unlock_irq(&pool->lock);
 	maybe_create_worker(pool);
+	raw_spin_lock_irq(&pool->lock);
 
 	pool->manager = NULL;
 	pool->flags &= ~POOL_MANAGER_ACTIVE;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups
       [not found]   ` <20220804123520.1660-1-hdanton@sina.com>
@ 2022-08-05  2:30     ` Lai Jiangshan
  0 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-05  2:30 UTC (permalink / raw)
  To: Hillf Danton; +Cc: LKML, linux-mm, Petr Mladek, Peter Zijlstra

 i

On Thu, Aug 4, 2022 at 8:35 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu,  4 Aug 2022 16:41:29 +0800 Lai Jiangshan wrote:
> >
> > @@ -1942,6 +1943,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> >               goto fail;
> >
> >       worker->id = id;
> > +     worker->pool = pool;
> >
> >       if (pool->cpu >= 0)
> >               snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
> > @@ -1949,6 +1951,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> >       else
> >               snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
> >
> > +     reinit_completion(&pool->created);
> >       worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
> >                                             "kworker/%s", id_buf);
> >       if (IS_ERR(worker->task))
> > @@ -1957,15 +1960,9 @@ static struct worker *create_worker(struct worker_pool *pool)
> >       set_user_nice(worker->task, pool->attrs->nice);
> >       kthread_bind_mask(worker->task, pool->attrs->cpumask);
> >
> > -     /* successful, attach the worker to the pool */
> > -     worker_attach_to_pool(worker, pool);
> > -
> >       /* start the newly created worker */
> > -     raw_spin_lock_irq(&pool->lock);
> > -     worker->pool->nr_workers++;
> > -     worker_enter_idle(worker);
> >       wake_up_process(worker->task);
> > -     raw_spin_unlock_irq(&pool->lock);
> > +     wait_for_completion(&pool->created);
> >
> >       return worker;
>
>         cpu0    cpu1            cpu2
>         ===     ===             ===
>                 complete
>
>         reinit_completion
>                                 wait_for_completion

reinit_completion() and wait_for_completion() are both in
create_worker().  create_worker() itself is mutually exclusive
which means no two create_worker()s running at the same time
for the same pool.

No work item can be added before the first initial create_worker()
returns for a new or first-online per-cpu pool, so there would be no
manager for the pool during the first initial create_worker().

The manager is the only worker who can call create_worker() for a pool
except the first initial create_worker().

And there is always only one manager after the first initial
create_worker().

The document style in some of workqueue code is:
"/* locking rule: what it is */"

For example:
struct list_head        worklist;       /* L: list of pending works */
which means it is protected by pool->lock.

And for
struct completion       created;        /* create_worker(): worker created */
it means it is protected by the exclusive create_worker().

>
> Any chance for race above?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool()
  2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
@ 2022-08-16 21:18   ` Tejun Heo
  2022-08-18 14:39     ` Lai Jiangshan
  2022-09-12  7:54   ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2022-08-16 21:18 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Linus Torvalds, Eric W. Biederman,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Valentin Schneider, Waiman Long

cc'ing Waiman.

On Thu, Aug 04, 2022 at 04:41:28PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> If a worker is spuriously woken up after kthread_bind_mask() but before
> worker_attach_to_pool(), and there are some cpu-hot-[un]plug happening
> during the same interval, the worker task might be pushed away from its
> bound CPU with its affinity changed by the scheduler and worker_attach_to_pool()
> doesn't rebind it properly.
> 
> Do unconditionally affinity binding in worker_attach_to_pool() to fix
> the problem.
> 
> Prepare for moving worker_attach_to_pool() from create_worker() to the
> starting of worker_thread() which will really cause the said interval
> even without spurious wakeup.

So, this looks fine but I think the whole thing can be simplified if we
integrate this with the persistent user cpumask change that Waiman is
working on. We can just set the cpumask once during init and let the
scheduler core figure out what the current effective mask is as CPU
availability changes.

 http://lkml.kernel.org/r/20220816192734.67115-4-longman@redhat.com

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups
  2022-08-04  8:41 ` [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups Lai Jiangshan
       [not found]   ` <20220804123520.1660-1-hdanton@sina.com>
@ 2022-08-16 21:46   ` Tejun Heo
  1 sibling, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2022-08-16 21:46 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Linus Torvalds, Eric W. Biederman,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Waiman Long

On Thu, Aug 04, 2022 at 04:41:29PM +0800, Lai Jiangshan wrote:
...
> Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> until it is explicitly woken. But a spurious wakeup might
> break this expectation.

I'd rephrase the above. It's more that we can't assume that a sleeping task
will stay sleeping and should expect spurious wakeups.

> @@ -176,6 +176,7 @@ struct worker_pool {
>  						/* L: hash of busy workers */
>  
>  	struct worker		*manager;	/* L: purely informational */
> +	struct completion	created;	/* create_worker(): worker created */

Can we define sth like worker_create_args struct which contains a
completion, pointers to the worker and pool and use an on-stack instance to
carry the create parameters to the new worker thread? It's kinda odd to have
a persistent copy of completion in the pool

> @@ -1949,6 +1951,7 @@ static struct worker *create_worker(struct worker_pool *pool)
>  	else
>  		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
>  
> +	reinit_completion(&pool->created);

which keeps getting reinitialized.

> @@ -2383,10 +2380,17 @@ static int worker_thread(void *__worker)
>  	struct worker *worker = __worker;
>  	struct worker_pool *pool = worker->pool;
>  
> +	/* attach the worker to the pool */
> +	worker_attach_to_pool(worker, pool);

It's also odd for the new worker to have pool already set and then we attach
to that pool.

> @@ -37,8 +37,15 @@ struct worker {
>  	/* 64 bytes boundary on 64bit, 32 on 32bit */
>  
>  	struct task_struct	*task;		/* I: worker task */
> -	struct worker_pool	*pool;		/* A: the associated pool */
> -						/* L: for rescuers */
> +
> +	/*
> +	 * The associated pool, locking rules:
> +	 *   PF_WQ_WORKER: from the current worker
> +	 *   PF_WQ_WORKER && wq_pool_attach_mutex: from remote tasks
> +	 *   None: from the current worker when the worker is coming up
> +	 */
> +	struct worker_pool	*pool;

I have a difficult time understanding the above comment. Can you please
follow the same style as others?

I was hoping that this problem would be fixed through kthread changes but
that doesn't seem to have happened yet and given that we need to keep
modifying cpumasks dynamically anyway (e.g. for unbound pool config change),
solving it from wq side is fine too especially if we can leverage the same
code paths that the dynamic changes are using.

That said, some of complexities are from CPU hotplug messing with worker
cpumasks and wq trying to restore them and it seems likely that all these
will be simpler with the persistent cpumask that Waiman is working on. Lai,
can you please take a look at that patchset?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker()
  2022-08-04  8:41 ` [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker() Lai Jiangshan
@ 2022-08-16 22:08   ` Tejun Heo
  2022-08-18 14:44     ` Lai Jiangshan
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2022-08-16 22:08 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: linux-kernel, Lai Jiangshan

On Thu, Aug 04, 2022 at 04:41:34PM +0800, Lai Jiangshan wrote:
> worker_thread() always does the recheck after getting the manager role,
> so the recheck in the maybe_create_worker() is unneeded and is removed.

So, before if multiple workers need to be created, a single manager would
create them all. After, we'd end up daisy chaining, right? One manager
creates one worker and goes to process one work item. The new worker wakes
up and becomes the manager and creates another worker and so on. That
doesn't seem like a desirable behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool()
  2022-08-16 21:18   ` Tejun Heo
@ 2022-08-18 14:39     ` Lai Jiangshan
  0 siblings, 0 replies; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-18 14:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: LKML, Lai Jiangshan, Linus Torvalds, Eric W. Biederman,
	Petr Mladek, Michal Hocko, Peter Zijlstra, Wedson Almeida Filho,
	Valentin Schneider, Waiman Long

On Wed, Aug 17, 2022 at 5:18 AM Tejun Heo <tj@kernel.org> wrote:
>
> cc'ing Waiman.
>
> On Thu, Aug 04, 2022 at 04:41:28PM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > If a worker is spuriously woken up after kthread_bind_mask() but before
> > worker_attach_to_pool(), and there are some cpu-hot-[un]plug happening
> > during the same interval, the worker task might be pushed away from its
> > bound CPU with its affinity changed by the scheduler and worker_attach_to_pool()
> > doesn't rebind it properly.
> >
> > Do unconditionally affinity binding in worker_attach_to_pool() to fix
> > the problem.
> >
> > Prepare for moving worker_attach_to_pool() from create_worker() to the
> > starting of worker_thread() which will really cause the said interval
> > even without spurious wakeup.
>
> So, this looks fine but I think the whole thing can be simplified if we
> integrate this with the persistent user cpumask change that Waiman is
> working on. We can just set the cpumask once during init and let the
> scheduler core figure out what the current effective mask is as CPU
> availability changes.
>
>  http://lkml.kernel.org/r/20220816192734.67115-4-longman@redhat.com
>

I like this approach.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker()
  2022-08-16 22:08   ` Tejun Heo
@ 2022-08-18 14:44     ` Lai Jiangshan
  2022-08-19 17:29       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Lai Jiangshan @ 2022-08-18 14:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Lai Jiangshan

On Wed, Aug 17, 2022 at 6:08 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Aug 04, 2022 at 04:41:34PM +0800, Lai Jiangshan wrote:
> > worker_thread() always does the recheck after getting the manager role,
> > so the recheck in the maybe_create_worker() is unneeded and is removed.
>
> So, before if multiple workers need to be created, a single manager would
> create them all. After, we'd end up daisy chaining, right? One manager
> creates one worker and goes to process one work item. The new worker wakes
> up and becomes the manager and creates another worker and so on. That
> doesn't seem like a desirable behavior.
>

The recheck is always in the same pool lock critical section, so the
behavior isn't changed before/after this patch.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker()
  2022-08-18 14:44     ` Lai Jiangshan
@ 2022-08-19 17:29       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2022-08-19 17:29 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: LKML, Lai Jiangshan

On Thu, Aug 18, 2022 at 10:44:02PM +0800, Lai Jiangshan wrote:
> On Wed, Aug 17, 2022 at 6:08 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Thu, Aug 04, 2022 at 04:41:34PM +0800, Lai Jiangshan wrote:
> > > worker_thread() always does the recheck after getting the manager role,
> > > so the recheck in the maybe_create_worker() is unneeded and is removed.
> >
> > So, before if multiple workers need to be created, a single manager would
> > create them all. After, we'd end up daisy chaining, right? One manager
> > creates one worker and goes to process one work item. The new worker wakes
> > up and becomes the manager and creates another worker and so on. That
> > doesn't seem like a desirable behavior.
> >
> 
> The recheck is always in the same pool lock critical section, so the
> behavior isn't changed before/after this patch.

Ah, right you are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool()
  2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
  2022-08-16 21:18   ` Tejun Heo
@ 2022-09-12  7:54   ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2022-09-12  7:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Linus Torvalds, Eric W. Biederman,
	Tejun Heo, Petr Mladek, Michal Hocko, Wedson Almeida Filho,
	Valentin Schneider

On Thu, Aug 04, 2022 at 04:41:28PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> If a worker is spuriously woken up after kthread_bind_mask() but before
> worker_attach_to_pool(), and there are some cpu-hot-[un]plug happening
> during the same interval, the worker task might be pushed away from its
> bound CPU with its affinity changed by the scheduler and worker_attach_to_pool()
> doesn't rebind it properly.

Can you *please* be more explicit. The above doesn't give me enough clue
to reconstruct the actual scenario you're fixing.

Draw a picture or something.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-09-12  7:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-04  8:41 [RFC PATCH 0/8] workqueue: Fix for prematurely wakeups and cleanups Lai Jiangshan
2022-08-04  8:41 ` [RFC PATCH 1/8] workqueue: Unconditionally set cpumask in worker_attach_to_pool() Lai Jiangshan
2022-08-16 21:18   ` Tejun Heo
2022-08-18 14:39     ` Lai Jiangshan
2022-09-12  7:54   ` Peter Zijlstra
2022-08-04  8:41 ` [RFC PATCH 2/8] workqueue: Make create_worker() safe against prematurely wakeups Lai Jiangshan
     [not found]   ` <20220804123520.1660-1-hdanton@sina.com>
2022-08-05  2:30     ` Lai Jiangshan
2022-08-16 21:46   ` Tejun Heo
2022-08-04  8:41 ` [RFC PATCH 3/8] workqueue: Set PF_NO_SETAFFINITY instead of kthread_bind_mask() Lai Jiangshan
2022-08-04  8:41 ` [RFC PATCH 4/8] workqueue: Set/Clear PF_WQ_WORKER while attaching/detaching Lai Jiangshan
2022-08-04  8:41 ` [RFC PATCH 5/8] workqueue: Use worker_set_flags() in worker_enter_idle() Lai Jiangshan
2022-08-04  8:41 ` [RFC PATCH 6/8] workqueue: Simplify the starting of the newly created worker Lai Jiangshan
2022-08-04  8:41 ` [RFC PATCH 7/8] workqueue: Remove the outer loop in maybe_create_worker() Lai Jiangshan
2022-08-16 22:08   ` Tejun Heo
2022-08-18 14:44     ` Lai Jiangshan
2022-08-19 17:29       ` Tejun Heo
2022-08-04  8:41 ` [RFC PATCH 8/8] workqueue: Move the locking out of maybe_create_worker() Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).