All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] workqueue: defer to waiting stop_machine
@ 2013-08-28 21:07 Jamie Liu
  2013-08-28 21:13 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Jamie Liu @ 2013-08-28 21:07 UTC (permalink / raw)
  To: Tejun Heo, Thomas Gleixner; +Cc: linux-kernel, Jamie Liu

A kworker running while stop_machine is waiting for CPUs can cause a
lockup if a work on the workqueue continuously reschedules itself due to
inability to make progress while stop_machine is pending. Fix this by
having workers stop working if stop_machine is pending.

Details:

Some ATA commands require that they run exclusively of other commands on
the same ATA port. To support this, ATA ports allow constituent links to
temporarily request exclusive access to the port by setting
ata_port.excl_link. If a link requests exclusive access, but must
temporarily delay the command, exclusive access remains held (to avoid
starving the command), preventing commands from other links on the same
port from proceeding.

Consider the following sequence of events:

1. An ATA command sent by the SCSI subsystem is dispatched to a
controller that requires excl_link for that command (as decided by e.g.
sil24_qc_defer() in drivers/ata/sata_sil24.c). The controller driver
sets excl_link and returns ATA_DEFER_PORT, causing scsi_dispatch_cmd()
to schedule a scsi_requeue_run_queue() work to retry the command.

2. A different command from the same ATA link is issued successfully.

3. The scsi_requeue_run_queue() work executes. scsi_request_fn() does
not dispatch the delayed command because the SCSI host is still busy
serving the command issued in (2), and instead calls blk_delay_queue()
to reschedule the command to be retried after SCSI_QUEUE_DELAY.

4. A command from a different ATA link on the same port is delayed
because the command from (1) set excl_link, causing scsi_dispatch_cmd()
to schedule a scsi_requeue_run_queue() work to retry the command.

5. The CPU whose gcwq contains the work from (4) switches to
worker_thread.

6. stop_machine() is called. All CPUs, except for the CPU running the
kworker in (5), switch to cpu_stopper_thread and enter
stop_machine_cpu_stop(), waiting for the CPU in (5) to switch as well.

7. The delay from (3) expires, causing the scsi_requeue_run_queue()
(that will retry the original delayed command from (1)) to be placed on
the gcwq for a CPU currently in cpu_stopper_thread.

8. The kworker on the CPU in (5) executes the scsi_requeue_run_queue()
work from (4). As in (4), the retried command is delayed due to
excl_link and causes another scsi_requeue_run_queue() work to be
rescheduled on the same CPU.

The machine is now locked up; one CPU is stuck in worker_thread, and all
others are stuck in cpu_stopper_thread. This is difficult to fix from
the ATA side, since the need for excl_link is legitimate, and difficult
to fix from the SCSI side, since the need to retry the request is also
legitimate (and performance-sensitive).

Signed-off-by: Jamie Liu <jamieliu@google.com>
---
 include/linux/stop_machine.h | 13 +++++++++++++
 kernel/stop_machine.c        | 16 ++++++++++++++++
 kernel/workqueue.c           |  4 +++-
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..a315f92 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -127,6 +127,14 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);
 int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 				   const struct cpumask *cpus);
 
+/**
+ * stop_machine_pending: check if a stop_machine is waiting on this CPU
+ *
+ * CONTEXT:
+ * Preemption must be disabled.
+ */
+bool stop_machine_pending(void);
+
 #else	 /* CONFIG_STOP_MACHINE && CONFIG_SMP */
 
 static inline int __stop_machine(int (*fn)(void *), void *data,
@@ -152,5 +160,10 @@ static inline int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 	return __stop_machine(fn, data, cpus);
 }
 
+static inline bool stop_machine_pending(void)
+{
+	return false;
+}
+
 #endif	/* CONFIG_STOP_MACHINE && CONFIG_SMP */
 #endif	/* _LINUX_STOP_MACHINE */
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..4b99914 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -541,4 +541,20 @@ int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
 	return ret ?: done.ret;
 }
 
+bool stop_machine_pending(void)
+{
+	struct cpu_stopper *stopper;
+	unsigned long flags;
+	bool has_pending;
+
+	if (unlikely(!stop_machine_initialized))
+		return false;
+
+	stopper = &per_cpu(cpu_stopper, smp_processor_id());
+	spin_lock_irqsave(&stopper->lock, flags);
+	has_pending = !list_empty(&stopper->works);
+	spin_unlock_irqrestore(&stopper->lock, flags);
+	return has_pending;
+}
+
 #endif	/* CONFIG_STOP_MACHINE */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7f5d4be..cd23fac 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -47,6 +47,7 @@
 #include <linux/nodemask.h>
 #include <linux/moduleparam.h>
 #include <linux/uaccess.h>
+#include <linux/stop_machine.h>
 
 #include "workqueue_internal.h"
 
@@ -734,7 +735,8 @@ static bool may_start_working(struct worker_pool *pool)
 static bool keep_working(struct worker_pool *pool)
 {
 	return !list_empty(&pool->worklist) &&
-		atomic_read(&pool->nr_running) <= 1;
+		atomic_read(&pool->nr_running) <= 1 &&
+		likely(!stop_machine_pending());
 }
 
 /* Do we need a new worker?  Called from manager. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] workqueue: defer to waiting stop_machine
  2013-08-28 21:07 [PATCH] workqueue: defer to waiting stop_machine Jamie Liu
@ 2013-08-28 21:13 ` Tejun Heo
  2013-08-28 21:33   ` [PATCH] workqueue: cond_resched() after processing each work item Tejun Heo
  2013-08-29 20:45   ` [PATCH] workqueue: defer to waiting stop_machine Andreas Mohr
  0 siblings, 2 replies; 6+ messages in thread
From: Tejun Heo @ 2013-08-28 21:13 UTC (permalink / raw)
  To: Jamie Liu; +Cc: Thomas Gleixner, linux-kernel

On Wed, Aug 28, 2013 at 02:07:12PM -0700, Jamie Liu wrote:
...
> @@ -734,7 +735,8 @@ static bool may_start_working(struct worker_pool *pool)
>  static bool keep_working(struct worker_pool *pool)
>  {
>  	return !list_empty(&pool->worklist) &&
> -		atomic_read(&pool->nr_running) <= 1;
> +		atomic_read(&pool->nr_running) <= 1 &&
> +		likely(!stop_machine_pending());

Isn't the problem that the kworker wouldn't yield to the higher
priority stopper task while a work item keeps requeueing itself if
preemption is not enabled?  If so, isn't the correct solution just
adding cond_resched() in the work item processing loop?  The analysis
and solution seem to have gone a bit stray....

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] workqueue: cond_resched() after processing each work item
  2013-08-28 21:13 ` Tejun Heo
@ 2013-08-28 21:33   ` Tejun Heo
  2013-08-28 23:11     ` Jamie Liu
  2013-08-29 20:45   ` [PATCH] workqueue: defer to waiting stop_machine Andreas Mohr
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2013-08-28 21:33 UTC (permalink / raw)
  To: Jamie Liu; +Cc: Thomas Gleixner, linux-kernel

Would something like the following work?  Can you please verify it?

Thanks.
---- 8< ----
If !PREEMPT, a kworker running work items back to back can hog CPU.
This becomes dangerous when a self-requeueing work item which is
waiting for something to happen races against stop_machine.  Such
self-requeueing work item would requeue itself indefinitely hogging
the kworker and CPU it's running on while stop_machine would wait for
that CPU to enter stop_machine while preventing anything else from
happening on all other CPUs.  The two would deadlock.

Jmamie Liu reports that this deadlock scenario exists around
scsi_requeue_run_queue() and libata port multiplier support, where one
port may exclude command processing from other ports.  With the right
timing, scsi_requeue_run_queue() can end up requeueing itself trying
to execute an IO which is asked to be retried while another device has
an exclusive access, which in turn can't make forward progress due to
stop_machine.

Fix it by invoking cond_resched() after executing each work item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jamie Liu <jamieliu@google.com>
References: http://thread.gmane.org/gmane.linux.kernel/1552567
Cc: stable@vger.kernel.org
--
 kernel/workqueue.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f02c4a4..73b662b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2201,6 +2201,15 @@ __acquires(&pool->lock)
 		dump_stack();
 	}
 
+	/*
+	 * The following prevents a kworker from hogging CPU on !PREEMPT
+	 * kernels, where a requeueing work item waiting for something to
+	 * happen could deadlock with stop_machine as such work item could
+	 * indefinitely requeue itself while all other CPUs are trapped in
+	 * stop_machine.
+	 */
+	cond_resched();
+
 	spin_lock_irq(&pool->lock);
 
 	/* clear cpu intensive status */

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] workqueue: cond_resched() after processing each work item
  2013-08-28 21:33   ` [PATCH] workqueue: cond_resched() after processing each work item Tejun Heo
@ 2013-08-28 23:11     ` Jamie Liu
  0 siblings, 0 replies; 6+ messages in thread
From: Jamie Liu @ 2013-08-28 23:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Thomas Gleixner, linux-kernel

Hi Tejun,

On Wed, Aug 28, 2013 at 2:33 PM, Tejun Heo <tj@kernel.org> wrote:
> Would something like the following work?  Can you please verify it?

I confirm that this works.

> Thanks.
> ---- 8< ----
> If !PREEMPT, a kworker running work items back to back can hog CPU.
> This becomes dangerous when a self-requeueing work item which is
> waiting for something to happen races against stop_machine.  Such
> self-requeueing work item would requeue itself indefinitely hogging
> the kworker and CPU it's running on while stop_machine would wait for
> that CPU to enter stop_machine while preventing anything else from
> happening on all other CPUs.  The two would deadlock.
>
> Jmamie Liu reports that this deadlock scenario exists around

s/Jmamie/Jamie/

> scsi_requeue_run_queue() and libata port multiplier support, where one
> port may exclude command processing from other ports.  With the right
> timing, scsi_requeue_run_queue() can end up requeueing itself trying
> to execute an IO which is asked to be retried while another device has
> an exclusive access, which in turn can't make forward progress due to
> stop_machine.
>
> Fix it by invoking cond_resched() after executing each work item.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jamie Liu <jamieliu@google.com>
> References: http://thread.gmane.org/gmane.linux.kernel/1552567
> Cc: stable@vger.kernel.org
> --
>  kernel/workqueue.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f02c4a4..73b662b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2201,6 +2201,15 @@ __acquires(&pool->lock)
>                 dump_stack();
>         }
>
> +       /*
> +        * The following prevents a kworker from hogging CPU on !PREEMPT
> +        * kernels, where a requeueing work item waiting for something to
> +        * happen could deadlock with stop_machine as such work item could
> +        * indefinitely requeue itself while all other CPUs are trapped in
> +        * stop_machine.
> +        */
> +       cond_resched();
> +
>         spin_lock_irq(&pool->lock);
>
>         /* clear cpu intensive status */

Thanks,
Jamie

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] workqueue: defer to waiting stop_machine
  2013-08-28 21:13 ` Tejun Heo
  2013-08-28 21:33   ` [PATCH] workqueue: cond_resched() after processing each work item Tejun Heo
@ 2013-08-29 20:45   ` Andreas Mohr
  2013-08-29 20:52     ` Jamie Liu
  1 sibling, 1 reply; 6+ messages in thread
From: Andreas Mohr @ 2013-08-29 20:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jamie Liu, Thomas Gleixner, linux-kernel

> Isn't the problem that the kworker wouldn't yield to the higher
> priority stopper task while a work item keeps requeueing itself if
> preemption is not enabled?  If so, isn't the correct solution just
> adding cond_resched() in the work item processing loop?  The analysis
> and solution seem to have gone a bit stray....

While I did not quite follow the very fine and detailed analysis,
I had the same feeling about it.

The previous solution seemed less preferable e.g. for two reasons,
from a modularity/dependency POV:
- required a very specific (code smell?) stop_machine handling dependency
  in work queue code (machine stop handling arguably definitely
  is a corner case, and thereby supposed to remain just that!)
- new stop_machine_pending() helper is pretty bloated,
  and called in a semi-hotpath to boot (since it's using && operators
  rather than ||, seems like it would be called pretty much every time)

Preemption checks being expected to be much more general and widespread
thus seems like a much better fit.


Or, to put it another way, could it be that that extra very specific
stop_machine check was simply added since due to missing preemption checks
we were busy-handling there and thus not getting back to standard handling areas
where some *usual*, *hotpath/mainstream* stop_machine checks would have been made?
If so, perhaps there actually are some other cases of wasteful stop_machine check
code sites in the kernel where instead we could simply have a much cheaper
reschedule done, thereby go back to hitting one central (and thus cache-hot)
code site with stop_machine check etc.?


Afraid of having stated the glaringly obvious ;),

Andreas Mohr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] workqueue: defer to waiting stop_machine
  2013-08-29 20:45   ` [PATCH] workqueue: defer to waiting stop_machine Andreas Mohr
@ 2013-08-29 20:52     ` Jamie Liu
  0 siblings, 0 replies; 6+ messages in thread
From: Jamie Liu @ 2013-08-29 20:52 UTC (permalink / raw)
  To: Andreas Mohr; +Cc: Tejun Heo, Thomas Gleixner, linux-kernel

Hi Andreas,

Just calling cond_resched() does appear to be the more general
solution, and is already on tj/wq/for-next as
b22ce2785d97423846206cceec4efee0c4afd980 "workqueue: cond_resched()
after processing each work item".

Thanks,
Jamie

On Thu, Aug 29, 2013 at 1:45 PM, Andreas Mohr <andi@lisas.de> wrote:
>> Isn't the problem that the kworker wouldn't yield to the higher
>> priority stopper task while a work item keeps requeueing itself if
>> preemption is not enabled?  If so, isn't the correct solution just
>> adding cond_resched() in the work item processing loop?  The analysis
>> and solution seem to have gone a bit stray....
>
> While I did not quite follow the very fine and detailed analysis,
> I had the same feeling about it.
>
> The previous solution seemed less preferable e.g. for two reasons,
> from a modularity/dependency POV:
> - required a very specific (code smell?) stop_machine handling dependency
>   in work queue code (machine stop handling arguably definitely
>   is a corner case, and thereby supposed to remain just that!)
> - new stop_machine_pending() helper is pretty bloated,
>   and called in a semi-hotpath to boot (since it's using && operators
>   rather than ||, seems like it would be called pretty much every time)
>
> Preemption checks being expected to be much more general and widespread
> thus seems like a much better fit.
>
>
> Or, to put it another way, could it be that that extra very specific
> stop_machine check was simply added since due to missing preemption checks
> we were busy-handling there and thus not getting back to standard handling areas
> where some *usual*, *hotpath/mainstream* stop_machine checks would have been made?
> If so, perhaps there actually are some other cases of wasteful stop_machine check
> code sites in the kernel where instead we could simply have a much cheaper
> reschedule done, thereby go back to hitting one central (and thus cache-hot)
> code site with stop_machine check etc.?
>
>
> Afraid of having stated the glaringly obvious ;),
>
> Andreas Mohr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-08-29 20:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-28 21:07 [PATCH] workqueue: defer to waiting stop_machine Jamie Liu
2013-08-28 21:13 ` Tejun Heo
2013-08-28 21:33   ` [PATCH] workqueue: cond_resched() after processing each work item Tejun Heo
2013-08-28 23:11     ` Jamie Liu
2013-08-29 20:45   ` [PATCH] workqueue: defer to waiting stop_machine Andreas Mohr
2013-08-29 20:52     ` Jamie Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.