linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue()
@ 2019-09-19  1:43 Tejun Heo
  2019-09-19  2:49 ` Lai Jiangshan
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2019-09-19  1:43 UTC (permalink / raw)
  To: Lai Jiangshan, ~
  Cc: linux-kernel, kernel-team, Marcin Pawlowski, Williams, Gerald S

Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle.  If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling.  It unfortunately
has a couple issues.

* Mayday list queueing increments pwq's refcnts which gets detected as
  busy and fails the sanity checks.  However, because mayday list
  queueing is asynchronous, this condition can happen without any
  actual work items left in the workqueue.

* Sanity check failure leaves the sysfs interface behind too which can
  lead to init failure of newer instances of the workqueue.

This patch fixes the above two by

* If a workqueue has a rescuer, disable and kill the rescuer before
  sanity checks.  Disabling and killing is guaranteed to flush the
  existing mayday list.

* Remove sysfs interface before sanity checks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
---
Applying to wq/for-5.4.

Thanks.

 kernel/workqueue.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 95aea04ff722..73e3ea9e31d3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4318,9 +4318,28 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	struct pool_workqueue *pwq;
 	int node;
 
+	/*
+	 * Remove it from sysfs first so that sanity check failure doesn't
+	 * lead to sysfs name conflicts.
+	 */
+	workqueue_sysfs_unregister(wq);
+
 	/* drain it before proceeding with destruction */
 	drain_workqueue(wq);
 
+	/* kill rescuer, if sanity checks fail, leave it w/o rescuer */
+	if (wq->rescuer) {
+		struct worker *rescuer = wq->rescuer;
+
+		/* this prevents new queueing */
+		spin_lock_irq(&wq_mayday_lock);
+		wq->rescuer = NULL;
+		spin_unlock_irq(&wq_mayday_lock);
+
+		/* rescuer will empty maydays list before exiting */
+		kthread_stop(rescuer->task);
+	}
+
 	/* sanity checks */
 	mutex_lock(&wq->mutex);
 	for_each_pwq(pwq, wq) {
@@ -4352,11 +4371,6 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del_rcu(&wq->list);
 	mutex_unlock(&wq_pool_mutex);
 
-	workqueue_sysfs_unregister(wq);
-
-	if (wq->rescuer)
-		kthread_stop(wq->rescuer->task);
-
 	if (!(wq->flags & WQ_UNBOUND)) {
 		wq_unregister_lockdep(wq);
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue()
  2019-09-19  1:43 [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue() Tejun Heo
@ 2019-09-19  2:49 ` Lai Jiangshan
  2019-09-20 21:08   ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Lai Jiangshan @ 2019-09-19  2:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: ~, LKML, kernel-team, Marcin Pawlowski, Williams, Gerald S

Looks good to me.

There is one test in show_pwq()
"""
 worker == pwq->wq->rescuer ? "(RESCUER)" : "",
"""
I'm wondering if it needs to be updated to
"""
worker->rescue_wq ? "(RESCUER)" : "",
"""

And document "/* MD: rescue worker */" might be better
than current "/* I: rescue worker */", although ->rescuer can
be accessed without wq_mayday_lock lock in some code.

Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>



On Thu, Sep 19, 2019 at 9:43 AM Tejun Heo <tj@kernel.org> wrote:
>
> Before actually destrying a workqueue, destroy_workqueue() checks
> whether it's actually idle.  If it isn't, it prints out a bunch of
> warning messages and leaves the workqueue dangling.  It unfortunately
> has a couple issues.
>
> * Mayday list queueing increments pwq's refcnts which gets detected as
>   busy and fails the sanity checks.  However, because mayday list
>   queueing is asynchronous, this condition can happen without any
>   actual work items left in the workqueue.
>
> * Sanity check failure leaves the sysfs interface behind too which can
>   lead to init failure of newer instances of the workqueue.
>
> This patch fixes the above two by
>
> * If a workqueue has a rescuer, disable and kill the rescuer before
>   sanity checks.  Disabling and killing is guaranteed to flush the
>   existing mayday list.
>
> * Remove sysfs interface before sanity checks.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
> Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
> Cc: stable@vger.kernel.org
> ---
> Applying to wq/for-5.4.
>
> Thanks.
>
>  kernel/workqueue.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 95aea04ff722..73e3ea9e31d3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4318,9 +4318,28 @@ void destroy_workqueue(struct workqueue_struct *wq)
>         struct pool_workqueue *pwq;
>         int node;
>
> +       /*
> +        * Remove it from sysfs first so that sanity check failure doesn't
> +        * lead to sysfs name conflicts.
> +        */
> +       workqueue_sysfs_unregister(wq);
> +
>         /* drain it before proceeding with destruction */
>         drain_workqueue(wq);
>
> +       /* kill rescuer, if sanity checks fail, leave it w/o rescuer */
> +       if (wq->rescuer) {
> +               struct worker *rescuer = wq->rescuer;
> +
> +               /* this prevents new queueing */
> +               spin_lock_irq(&wq_mayday_lock);
> +               wq->rescuer = NULL;
> +               spin_unlock_irq(&wq_mayday_lock);
> +
> +               /* rescuer will empty maydays list before exiting */
> +               kthread_stop(rescuer->task);
> +       }
> +
>         /* sanity checks */
>         mutex_lock(&wq->mutex);
>         for_each_pwq(pwq, wq) {
> @@ -4352,11 +4371,6 @@ void destroy_workqueue(struct workqueue_struct *wq)
>         list_del_rcu(&wq->list);
>         mutex_unlock(&wq_pool_mutex);
>
> -       workqueue_sysfs_unregister(wq);
> -
> -       if (wq->rescuer)
> -               kthread_stop(wq->rescuer->task);
> -
>         if (!(wq->flags & WQ_UNBOUND)) {
>                 wq_unregister_lockdep(wq);
>                 /*
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue()
  2019-09-19  2:49 ` Lai Jiangshan
@ 2019-09-20 21:08   ` Tejun Heo
  2019-09-20 21:12     ` [PATCH] workqueue: Minor follow-ups to the rescuer destruction change Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2019-09-20 21:08 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: ~, LKML, kernel-team, Marcin Pawlowski, Williams, Gerald S

Hello,

On Thu, Sep 19, 2019 at 10:49:04AM +0800, Lai Jiangshan wrote:
> Looks good to me.
> 
> There is one test in show_pwq()
> """
>  worker == pwq->wq->rescuer ? "(RESCUER)" : "",
> """
> I'm wondering if it needs to be updated to
> """
> worker->rescue_wq ? "(RESCUER)" : "",
> """

Hmm... yeah, good point.  Let's do that.

> And document "/* MD: rescue worker */" might be better
> than current "/* I: rescue worker */", although ->rescuer can
> be accessed without wq_mayday_lock lock in some code.

Will apply this one too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] workqueue: Minor follow-ups to the rescuer destruction change
  2019-09-20 21:08   ` Tejun Heo
@ 2019-09-20 21:12     ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2019-09-20 21:12 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: LKML, kernel-team, Marcin Pawlowski, Williams, Gerald S

From 30ae2fc0a75eb5f1a38bbee7355d8e3bc823a6e1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 20 Sep 2019 14:09:14 -0700

* Now that wq->rescuer may be cleared while rescuer is still there,
  switch show_pwq() debug printout to test worker->rescue_wq to
  identify rescuers intead of testing wq->rescuer.

* Update comment on ->rescuer locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
---
Applied to workqueue/for-5.4-fixes.

Thanks.

 kernel/workqueue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f067f1d72e3..7f7aa45dac37 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -248,7 +248,7 @@ struct workqueue_struct {
 	struct list_head	flusher_overflow; /* WQ: flush overflow list */
 
 	struct list_head	maydays;	/* MD: pwqs requesting rescue */
-	struct worker		*rescuer;	/* I: rescue worker */
+	struct worker		*rescuer;	/* MD: rescue worker */
 
 	int			nr_drainers;	/* WQ: drain in progress */
 	int			saved_max_active; /* WQ: saved pwq max_active */
@@ -4672,7 +4672,7 @@ static void show_pwq(struct pool_workqueue *pwq)
 
 			pr_cont("%s %d%s:%ps", comma ? "," : "",
 				task_pid_nr(worker->task),
-				worker == pwq->wq->rescuer ? "(RESCUER)" : "",
+				worker->rescue_wq ? "(RESCUER)" : "",
 				worker->current_func);
 			list_for_each_entry(work, &worker->scheduled, entry)
 				pr_cont_work(false, work);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-20 21:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  1:43 [PATCH] workqueue: Fix spurious sanity check failures in destroy_workqueue() Tejun Heo
2019-09-19  2:49 ` Lai Jiangshan
2019-09-20 21:08   ` Tejun Heo
2019-09-20 21:12     ` [PATCH] workqueue: Minor follow-ups to the rescuer destruction change Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).