Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()

From: Lai Jiangshan <jiangshanlai@gmail.com>
To: Valentin Schneider <vschneid@redhat.com>
Cc: LKML <linux-kernel@vger.kernel.org>, Tejun Heo <tj@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Juri Lelli <juri.lelli@redhat.com>, Phil Auld <pauld@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [RFC PATCH v2 1/2] workqueue: Unbind workers before sending them to exit()
Date: Thu, 28 Jul 2022 01:13:37 +0800	[thread overview]
Message-ID: <CAJhGHyCeraX1jcea9kt_FBC561zBgECuw5qx8TAdCG0EHnT6kA@mail.gmail.com> (raw)
In-Reply-To: <20220727115327.2273547-2-vschneid@redhat.com>

Quick review before going to sleep.

On Wed, Jul 27, 2022 at 7:54 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> It has been reported that isolated CPUs can suffer from interference due to
> per-CPU kworkers waking up just to die.
>
> A surge of workqueue activity during initial setup of a latency-sensitive
> application (refresh_vm_stats() being one of the culprits) can cause extra
> per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
> running merrily on an isolated CPU only to be interrupted sometime later by
> a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
> kworker activity).
>
> Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
> contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
> them with WORKER_DIE.
>
> Changing the affinity does require a sleepable context, so get rid of the
> pool->idle_timer and use a delayed_work instead.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  kernel/workqueue.c | 109 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 81 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..27642166dcc5 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -167,9 +167,9 @@ struct worker_pool {
>         int                     nr_workers;     /* L: total number of workers */
>         int                     nr_idle;        /* L: currently idle workers */
>
> -       struct list_head        idle_list;      /* L: list of idle workers */
> -       struct timer_list       idle_timer;     /* L: worker idle timeout */
> -       struct timer_list       mayday_timer;   /* L: SOS timer for workers */
> +       struct list_head        idle_list;        /* L: list of idle workers */
> +       struct delayed_work     idle_reaper_work; /* L: worker idle timeout */
> +       struct timer_list       mayday_timer;     /* L: SOS timer for workers */
>
>         /* a workers is either on busy_hash or idle_list, or the manager */
>         DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
> @@ -1806,8 +1806,10 @@ static void worker_enter_idle(struct worker *worker)
>         /* idle_list is LIFO */
>         list_add(&worker->entry, &pool->idle_list);
>
> -       if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
> -               mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
> +       if (too_many_workers(pool) && !delayed_work_pending(&pool->idle_reaper_work))
> +               mod_delayed_work(system_unbound_wq,
> +                                &pool->idle_reaper_work,
> +                                IDLE_WORKER_TIMEOUT);

system_unbound_wq doesn't have a rescuer.

A new workqueue with a rescuer needs to be created and used for
this purpose.

>
>         /* Sanity check nr_running. */
>         WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
> @@ -1972,9 +1974,29 @@ static struct worker *create_worker(struct worker_pool *pool)
>         return NULL;
>  }
>
> +static void unbind_worker(struct worker *worker)
> +{
> +       kthread_set_per_cpu(worker->task, -1);
> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> +}
> +
> +static void rebind_worker(struct worker *worker, struct worker_pool *pool)
> +{
> +       kthread_set_per_cpu(worker->task, pool->cpu);
> +       WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> +}
> +
> +static void reap_worker(struct worker *worker)
> +{
> +       list_del_init(&worker->entry);
> +       unbind_worker(worker);
> +       wake_up_process(worker->task);

Since WORKER_DIE is set, the worker can be possible freed now
if there is another source to wake it up.

I think reverting a part of the commit 60f5a4bcf852("workqueue:
async worker destruction") to make use of kthread_stop()
in destroy_worker() should be a good idea.