All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block: Fix a race in the runtime power management code
@ 2020-08-24  3:06 Bart Van Assche
  2020-08-24 14:47 ` Alan Stern
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Bart Van Assche @ 2020-08-24  3:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Alan Stern,
	Stanley Chu, Ming Lei, stable, Can Guo

With the current implementation the following race can happen:
* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
  blk_mq_unfreeze_queue().
* blk_queue_enter() calls blk_queue_pm_only() and that function returns
  true.
* blk_queue_enter() calls blk_pm_request_resume() and that function does
  not call pm_request_resume() because the queue runtime status is
  RPM_ACTIVE.
* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable <stable@vger.kernel.org>
Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-pm.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index b85234d758f7..17bd020268d4 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 
 	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
 
+	spin_lock_irq(&q->queue_lock);
+	q->rpm_status = RPM_SUSPENDING;
+	spin_unlock_irq(&q->queue_lock);
+
 	/*
 	 * Increase the pm_only counter before checking whether any
 	 * non-PM blk_queue_enter() calls are in progress to avoid that any
@@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	/* Switch q_usage_counter back to per-cpu mode. */
 	blk_mq_unfreeze_queue(q);
 
-	spin_lock_irq(&q->queue_lock);
-	if (ret < 0)
+	if (ret < 0) {
+		spin_lock_irq(&q->queue_lock);
+		q->rpm_status = RPM_ACTIVE;
 		pm_runtime_mark_last_busy(q->dev);
-	else
-		q->rpm_status = RPM_SUSPENDING;
-	spin_unlock_irq(&q->queue_lock);
+		spin_unlock_irq(&q->queue_lock);
 
-	if (ret)
 		blk_clear_pm_only(q);
+	}
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-24  3:06 [PATCH] block: Fix a race in the runtime power management code Bart Van Assche
@ 2020-08-24 14:47 ` Alan Stern
  2020-08-25  9:01 ` Stanley Chu
  2020-08-25  9:11 ` Stanley Chu
  2 siblings, 0 replies; 16+ messages in thread
From: Alan Stern @ 2020-08-24 14:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Stanley Chu,
	Ming Lei, stable, Can Guo

On Sun, Aug 23, 2020 at 08:06:07PM -0700, Bart Van Assche wrote:
> With the current implementation the following race can happen:
> * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
>   blk_mq_unfreeze_queue().
> * blk_queue_enter() calls blk_queue_pm_only() and that function returns
>   true.
> * blk_queue_enter() calls blk_pm_request_resume() and that function does
>   not call pm_request_resume() because the queue runtime status is
>   RPM_ACTIVE.
> * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
> 
> Fix this race by changing the queue runtime status into RPM_SUSPENDING
> before switching q_usage_counter to atomic mode.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Stanley Chu <stanley.chu@mediatek.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: stable <stable@vger.kernel.org>
> Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
> Signed-off-by: Can Guo <cang@codeaurora.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-pm.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index b85234d758f7..17bd020268d4 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  
>  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
>  
> +	spin_lock_irq(&q->queue_lock);
> +	q->rpm_status = RPM_SUSPENDING;
> +	spin_unlock_irq(&q->queue_lock);
> +
>  	/*
>  	 * Increase the pm_only counter before checking whether any
>  	 * non-PM blk_queue_enter() calls are in progress to avoid that any
> @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  	/* Switch q_usage_counter back to per-cpu mode. */
>  	blk_mq_unfreeze_queue(q);
>  
> -	spin_lock_irq(&q->queue_lock);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		spin_lock_irq(&q->queue_lock);
> +		q->rpm_status = RPM_ACTIVE;
>  		pm_runtime_mark_last_busy(q->dev);
> -	else
> -		q->rpm_status = RPM_SUSPENDING;
> -	spin_unlock_irq(&q->queue_lock);
> +		spin_unlock_irq(&q->queue_lock);
>  
> -	if (ret)
>  		blk_clear_pm_only(q);
> +	}
>  
>  	return ret;
>  }

Acked-by: Alan Stern <stern@rowland.harvard.edu>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-24  3:06 [PATCH] block: Fix a race in the runtime power management code Bart Van Assche
  2020-08-24 14:47 ` Alan Stern
@ 2020-08-25  9:01 ` Stanley Chu
  2020-08-25  9:11 ` Stanley Chu
  2 siblings, 0 replies; 16+ messages in thread
From: Stanley Chu @ 2020-08-25  9:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Alan Stern, Ming Lei,
	stable, Can Guo

Hi Bart,

On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:
> With the current implementation the following race can happen:
> * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
>   blk_mq_unfreeze_queue().
> * blk_queue_enter() calls blk_queue_pm_only() and that function returns
>   true.
> * blk_queue_enter() calls blk_pm_request_resume() and that function does
>   not call pm_request_resume() because the queue runtime status is
>   RPM_ACTIVE.
> * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
> 
> Fix this race by changing the queue runtime status into RPM_SUSPENDING
> before switching q_usage_counter to atomic mode.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Stanley Chu <stanley.chu@mediatek.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: stable <stable@vger.kernel.org>
> Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
> Signed-off-by: Can Guo <cang@codeaurora.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-pm.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index b85234d758f7..17bd020268d4 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  
>  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
>  
> +	spin_lock_irq(&q->queue_lock);
> +	q->rpm_status = RPM_SUSPENDING;
> +	spin_unlock_irq(&q->queue_lock);
> +

Has below alternative way been considered that RPM_SUSPENDING is set
after blk_mq_unfreeze_queue()?

	blk_freeze_queue_start(q);

+	spin_lock_irq(&q->queue_lock);
+	q->rpm_status = RPM_SUSPENDING;
+	spin_unlock_irq(&q->queue_lock);


Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING
during a small window, i.e., before blk_set_pm_only() is invoked. This
would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

- Before blk_freeze_queue_start(), any requests shall be allowed to
enter queue
- blk_freeze_queue_start() freezes the queue and blocks all upcoming
requests (make them wait_event(q->mq_freeze_wq))
- rpm_status is set as RPM_SUSPENDING
- blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then
blk_pm_request_resume() can be executed

Thanks,

Stanley Chu


>  	/*
>  	 * Increase the pm_only counter before checking whether any
>  	 * non-PM blk_queue_enter() calls are in progress to avoid that any
> @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  	/* Switch q_usage_counter back to per-cpu mode. */
>  	blk_mq_unfreeze_queue(q);
>  
> -	spin_lock_irq(&q->queue_lock);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		spin_lock_irq(&q->queue_lock);
> +		q->rpm_status = RPM_ACTIVE;
>  		pm_runtime_mark_last_busy(q->dev);
> -	else
> -		q->rpm_status = RPM_SUSPENDING;
> -	spin_unlock_irq(&q->queue_lock);
> +		spin_unlock_irq(&q->queue_lock);
>  
> -	if (ret)
>  		blk_clear_pm_only(q);
> +	}
>  
>  	return ret;
>  }





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-24  3:06 [PATCH] block: Fix a race in the runtime power management code Bart Van Assche
  2020-08-24 14:47 ` Alan Stern
  2020-08-25  9:01 ` Stanley Chu
@ 2020-08-25  9:11 ` Stanley Chu
  2020-08-25 18:24   ` Alan Stern
  2020-08-26  2:58   ` Bart Van Assche
  2 siblings, 2 replies; 16+ messages in thread
From: Stanley Chu @ 2020-08-25  9:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Alan Stern, Ming Lei,
	stable, Can Guo

Sorry, resend to fix typo.

Hi Bart,

On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:
> With the current implementation the following race can happen:
> * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
>   blk_mq_unfreeze_queue().
> * blk_queue_enter() calls blk_queue_pm_only() and that function returns
>   true.
> * blk_queue_enter() calls blk_pm_request_resume() and that function does
>   not call pm_request_resume() because the queue runtime status is
>   RPM_ACTIVE.
> * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
> 
> Fix this race by changing the queue runtime status into RPM_SUSPENDING
> before switching q_usage_counter to atomic mode.
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Stanley Chu <stanley.chu@mediatek.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: stable <stable@vger.kernel.org>
> Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
> Signed-off-by: Can Guo <cang@codeaurora.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-pm.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index b85234d758f7..17bd020268d4 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  
>  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
>  
> +	spin_lock_irq(&q->queue_lock);
> +	q->rpm_status = RPM_SUSPENDING;
> +	spin_unlock_irq(&q->queue_lock);
> +

Has below alternative way been considered that RPM_SUSPENDING is set
after blk_freeze_queue_start()?

	blk_freeze_queue_start(q);

+	spin_lock_irq(&q->queue_lock);
+	q->rpm_status = RPM_SUSPENDING;
+	spin_unlock_irq(&q->queue_lock);


Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING
during a small window, i.e., before blk_set_pm_only() is invoked. This
would make the definition of rpm_status ambiguous.

In this way, the racing could be also solved:

- Before blk_freeze_queue_start(), any requests shall be allowed to
enter queue
- blk_freeze_queue_start() freezes the queue and blocks all upcoming
requests (make them wait_event(q->mq_freeze_wq))
- rpm_status is set as RPM_SUSPENDING
- blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then
blk_pm_request_resume() can be executed

Thanks,

Stanley Chu


>  	/*
>  	 * Increase the pm_only counter before checking whether any
>  	 * non-PM blk_queue_enter() calls are in progress to avoid that any
> @@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  	/* Switch q_usage_counter back to per-cpu mode. */
>  	blk_mq_unfreeze_queue(q);
>  
> -	spin_lock_irq(&q->queue_lock);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		spin_lock_irq(&q->queue_lock);
> +		q->rpm_status = RPM_ACTIVE;
>  		pm_runtime_mark_last_busy(q->dev);
> -	else
> -		q->rpm_status = RPM_SUSPENDING;
> -	spin_unlock_irq(&q->queue_lock);
> +		spin_unlock_irq(&q->queue_lock);
>  
> -	if (ret)
>  		blk_clear_pm_only(q);
> +	}
>  
>  	return ret;
>  }






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-25  9:11 ` Stanley Chu
@ 2020-08-25 18:24   ` Alan Stern
  2020-08-25 22:22     ` Bart Van Assche
  2020-08-26  2:58   ` Bart Van Assche
  1 sibling, 1 reply; 16+ messages in thread
From: Alan Stern @ 2020-08-25 18:24 UTC (permalink / raw)
  To: Stanley Chu
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo

On Tue, Aug 25, 2020 at 05:11:21PM +0800, Stanley Chu wrote:
> Sorry, resend to fix typo.
> 
> Hi Bart,
> 
> On Sun, 2020-08-23 at 20:06 -0700, Bart Van Assche wrote:
> > With the current implementation the following race can happen:
> > * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
> >   blk_mq_unfreeze_queue().
> > * blk_queue_enter() calls blk_queue_pm_only() and that function returns
> >   true.
> > * blk_queue_enter() calls blk_pm_request_resume() and that function does
> >   not call pm_request_resume() because the queue runtime status is
> >   RPM_ACTIVE.
> > * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
> > 
> > Fix this race by changing the queue runtime status into RPM_SUSPENDING
> > before switching q_usage_counter to atomic mode.
> > 
> > Cc: Alan Stern <stern@rowland.harvard.edu>
> > Cc: Stanley Chu <stanley.chu@mediatek.com>
> > Cc: Ming Lei <ming.lei@redhat.com>
> > Cc: stable <stable@vger.kernel.org>
> > Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
> > Signed-off-by: Can Guo <cang@codeaurora.org>
> > Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> > ---
> >  block/blk-pm.c | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/blk-pm.c b/block/blk-pm.c
> > index b85234d758f7..17bd020268d4 100644
> > --- a/block/blk-pm.c
> > +++ b/block/blk-pm.c
> > @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> >  
> >  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> >  
> > +	spin_lock_irq(&q->queue_lock);
> > +	q->rpm_status = RPM_SUSPENDING;
> > +	spin_unlock_irq(&q->queue_lock);
> > +
> 
> Has below alternative way been considered that RPM_SUSPENDING is set
> after blk_freeze_queue_start()?
> 
> 	blk_freeze_queue_start(q);
> 
> +	spin_lock_irq(&q->queue_lock);
> +	q->rpm_status = RPM_SUSPENDING;
> +	spin_unlock_irq(&q->queue_lock);
> 
> 
> Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING
> during a small window, i.e., before blk_set_pm_only() is invoked. This
> would make the definition of rpm_status ambiguous.
> 
> In this way, the racing could be also solved:
> 
> - Before blk_freeze_queue_start(), any requests shall be allowed to
> enter queue
> - blk_freeze_queue_start() freezes the queue and blocks all upcoming
> requests (make them wait_event(q->mq_freeze_wq))
> - rpm_status is set as RPM_SUSPENDING
> - blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then
> blk_pm_request_resume() can be executed

A very similar question arises concerning the transition from 
RPM_SUSPENDING to RPM_SUSPENDED.  I am not convinced that the existing 
synchronization is sufficient.  However, this may not matter because...

A related question concerns the BLK_MQ_REQ_PREEMPT flag.  If it is set 
then the request is allowed whilel rpm_status is RPM_SUSPENDING.  But in 
fact, the only requests which should be allowed at that time are those 
created by the lower-layer driver as part of its runtime-suspend 
handling; all other requests should be deferred.  The BLK_MQ_REQ_PREEMPT 
flag doesn't seem like the right way to achieve this.  Should we be 
using a different flag?

Alan Stern

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-25 18:24   ` Alan Stern
@ 2020-08-25 22:22     ` Bart Van Assche
  2020-08-26  1:51       ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-08-25 22:22 UTC (permalink / raw)
  To: Alan Stern, Stanley Chu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Ming Lei, stable, Can Guo

On 2020-08-25 11:24, Alan Stern wrote:
> A related question concerns the BLK_MQ_REQ_PREEMPT flag.  If it is set 
> then the request is allowed while rpm_status is RPM_SUSPENDING.  But in 
> fact, the only requests which should be allowed at that time are those 
> created by the lower-layer driver as part of its runtime-suspend 
> handling; all other requests should be deferred.  The BLK_MQ_REQ_PREEMPT 
> flag doesn't seem like the right way to achieve this.  Should we be 
> using a different flag?

I think there is already a flag that is used to mark power management
commands, namely RQF_PM.

My understanding is that the BLK_MQ_REQ_PREEMPT/RQF_PREEMPT flags are used
for the following purposes:
* In the SCSI core, scsi_execute() sets the BLK_MQ_PREEMPT flag. As a
  result, SCSI commands submitted with scsi_execute() are processed in
  the SDEV_QUIESCE SCSI device state. Since scsi_execute() is only used
  to submit other than medium access commands, in the SDEV_QUIESCE state
  medium access commands are paused and commands that do not access the
  storage medium (e.g. START/STOP commands) are processed.
* In the IDE-driver, for making one request preempt another request. From
  an old IDE commit (not sure if this is still relevant):
  + * If action is ide_preempt, then the rq is queued at the head of
  + * the request queue, displacing the currently-being-processed
  + * request and this function returns immediately without waiting
  + * for the new rq to be completed.  This is VERY DANGEROUS, and is
  + * intended for careful use by the ATAPI tape/cdrom driver code.

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-25 22:22     ` Bart Van Assche
@ 2020-08-26  1:51       ` Alan Stern
  2020-08-27  3:35         ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2020-08-26  1:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo

On Tue, Aug 25, 2020 at 03:22:03PM -0700, Bart Van Assche wrote:
> On 2020-08-25 11:24, Alan Stern wrote:
> > A related question concerns the BLK_MQ_REQ_PREEMPT flag.  If it is set 
> > then the request is allowed while rpm_status is RPM_SUSPENDING.  But in 
> > fact, the only requests which should be allowed at that time are those 
> > created by the lower-layer driver as part of its runtime-suspend 
> > handling; all other requests should be deferred.  The BLK_MQ_REQ_PREEMPT 
> > flag doesn't seem like the right way to achieve this.  Should we be 
> > using a different flag?
> 
> I think there is already a flag that is used to mark power management
> commands, namely RQF_PM.
> 
> My understanding is that the BLK_MQ_REQ_PREEMPT/RQF_PREEMPT flags are used
> for the following purposes:
> * In the SCSI core, scsi_execute() sets the BLK_MQ_PREEMPT flag. As a
>   result, SCSI commands submitted with scsi_execute() are processed in
>   the SDEV_QUIESCE SCSI device state. Since scsi_execute() is only used
>   to submit other than medium access commands, in the SDEV_QUIESCE state
>   medium access commands are paused and commands that do not access the
>   storage medium (e.g. START/STOP commands) are processed.
> * In the IDE-driver, for making one request preempt another request. From
>   an old IDE commit (not sure if this is still relevant):
>   + * If action is ide_preempt, then the rq is queued at the head of
>   + * the request queue, displacing the currently-being-processed
>   + * request and this function returns immediately without waiting
>   + * for the new rq to be completed.  This is VERY DANGEROUS, and is
>   + * intended for careful use by the ATAPI tape/cdrom driver code.

Ah, perfect.  So in blk_queue_enter(), pm should be defined in terms of 
RQF_PM rather than BLK_MQ_REQ_PREEMPT.

The difficulty is that the flags argument is the wrong type; RQF_PM is 
defined as req_flags_t, not blk_mq_req_flags_t.  It is associated with a 
particular request after the request has been created, so after 
blk_queue_enter() has been called.

How can we solve this?

Alan Stern

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-25  9:11 ` Stanley Chu
  2020-08-25 18:24   ` Alan Stern
@ 2020-08-26  2:58   ` Bart Van Assche
  2020-08-26  4:00     ` Stanley Chu
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-08-26  2:58 UTC (permalink / raw)
  To: Stanley Chu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Alan Stern, Ming Lei,
	stable, Can Guo

On 2020-08-25 02:11, Stanley Chu wrote:
>> diff --git a/block/blk-pm.c b/block/blk-pm.c
>> index b85234d758f7..17bd020268d4 100644
>> --- a/block/blk-pm.c
>> +++ b/block/blk-pm.c
>> @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>>  
>>  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
>>  
>> +	spin_lock_irq(&q->queue_lock);
>> +	q->rpm_status = RPM_SUSPENDING;
>> +	spin_unlock_irq(&q->queue_lock);
>> +
> 
> Has below alternative way been considered that RPM_SUSPENDING is set
> after blk_freeze_queue_start()?
> 
> 	blk_freeze_queue_start(q);
> 
> +	spin_lock_irq(&q->queue_lock);
> +	q->rpm_status = RPM_SUSPENDING;
> +	spin_unlock_irq(&q->queue_lock);
> 
> 
> Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING
> during a small window, i.e., before blk_set_pm_only() is invoked. This
> would make the definition of rpm_status ambiguous.
> 
> In this way, the racing could be also solved:
> 
> - Before blk_freeze_queue_start(), any requests shall be allowed to
> enter queue
> - blk_freeze_queue_start() freezes the queue and blocks all upcoming
> requests (make them wait_event(q->mq_freeze_wq))
> - rpm_status is set as RPM_SUSPENDING
> - blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then
> blk_pm_request_resume() can be executed

Hi Stanley,

I prefer the order from the patch. I think it is important to change
q->rpm_status into RPM_SUSPENDING before blk_queue_enter() calls
blk_queue_pm_only(). Otherwise it could happen that blk_queue_enter()
calls blk_pm_request_resume() while q->rpm_status == RPM_ACTIVE, resulting
in blk_queue_enter() not resuming a queue although that queue should be
resumed.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-26  2:58   ` Bart Van Assche
@ 2020-08-26  4:00     ` Stanley Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Stanley Chu @ 2020-08-26  4:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Alan Stern, Ming Lei,
	stable, Can Guo

Hi Bart,

On Tue, 2020-08-25 at 19:58 -0700, Bart Van Assche wrote:
> On 2020-08-25 02:11, Stanley Chu wrote:
> >> diff --git a/block/blk-pm.c b/block/blk-pm.c
> >> index b85234d758f7..17bd020268d4 100644
> >> --- a/block/blk-pm.c
> >> +++ b/block/blk-pm.c
> >> @@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> >>  
> >>  	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> >>  
> >> +	spin_lock_irq(&q->queue_lock);
> >> +	q->rpm_status = RPM_SUSPENDING;
> >> +	spin_unlock_irq(&q->queue_lock);
> >> +
> > 
> > Has below alternative way been considered that RPM_SUSPENDING is set
> > after blk_freeze_queue_start()?
> > 
> > 	blk_freeze_queue_start(q);
> > 
> > +	spin_lock_irq(&q->queue_lock);
> > +	q->rpm_status = RPM_SUSPENDING;
> > +	spin_unlock_irq(&q->queue_lock);
> > 
> > 
> > Otherwise requests can enter queue while rpm_status is RPM_SUSPENDING
> > during a small window, i.e., before blk_set_pm_only() is invoked. This
> > would make the definition of rpm_status ambiguous.
> > 
> > In this way, the racing could be also solved:
> > 
> > - Before blk_freeze_queue_start(), any requests shall be allowed to
> > enter queue
> > - blk_freeze_queue_start() freezes the queue and blocks all upcoming
> > requests (make them wait_event(q->mq_freeze_wq))
> > - rpm_status is set as RPM_SUSPENDING
> > - blk_mq_unfreeze_queue() wakes up q->mq_freeze_wq and then
> > blk_pm_request_resume() can be executed
> 
> Hi Stanley,
> 
> I prefer the order from the patch. I think it is important to change
> q->rpm_status into RPM_SUSPENDING before blk_queue_enter() calls
> blk_queue_pm_only(). Otherwise it could happen that blk_queue_enter()
> calls blk_pm_request_resume() while q->rpm_status == RPM_ACTIVE, resulting
> in blk_queue_enter() not resuming a queue although that queue should be
> resumed.

I see. Thanks for the explanation.

By the order from the patch, it is guaranteed that
blk_pm_request_resume() will always trigger runtime resume flow during
blk_queue_enter() after blk_set_pm_only() is called by concurrent
suspend flow, i.e., blk_pre_runtime_suspend().

However it would have an ambiguous RPM_SUSPENDING rpm_status even though
it may not cause issues obviously.

Acked-By: Stanley Chu <stanley.chu@mediatek.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-26  1:51       ` Alan Stern
@ 2020-08-27  3:35         ` Bart Van Assche
  2020-08-27 20:33           ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-08-27  3:35 UTC (permalink / raw)
  To: Alan Stern
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen

On 2020-08-25 18:51, Alan Stern wrote:
> Ah, perfect.  So in blk_queue_enter(), pm should be defined in terms of 
> RQF_PM rather than BLK_MQ_REQ_PREEMPT.
> 
> The difficulty is that the flags argument is the wrong type; RQF_PM is 
> defined as req_flags_t, not blk_mq_req_flags_t.  It is associated with a 
> particular request after the request has been created, so after 
> blk_queue_enter() has been called.
> 
> How can we solve this?

The current code looks a bit weird because my focus when modifying the PM
code has been on not breaking any existing code.

scsi_device_quiesce() relies on blk_queue_enter() processing all PREEMPT
requests. A difficulty is that scsi_device_quiesce() is used for two
separate purposes:
* Runtime power management.
* SCSI domain validation. See e.g. https://lwn.net/Articles/75917/.

I think that modifying blk_queue_enter() such that it only accepts PM
requests will require to split scsi_device_quiesce() into two functions:
one function that is used by the runtime power management code and another
function that is used by the SCSI domain validation code. This may require
to introduce new SCSI device states. If new SCSI device states are
introduced, that should be done without modifying the state that is
reported to user space. See also sdev_states[] and show_state_field in
scsi_sysfs.c.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-27  3:35         ` Bart Van Assche
@ 2020-08-27 20:33           ` Alan Stern
  2020-08-28  3:27             ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2020-08-27 20:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen

On Wed, Aug 26, 2020 at 08:35:42PM -0700, Bart Van Assche wrote:
> On 2020-08-25 18:51, Alan Stern wrote:
> > Ah, perfect.  So in blk_queue_enter(), pm should be defined in terms of 
> > RQF_PM rather than BLK_MQ_REQ_PREEMPT.
> > 
> > The difficulty is that the flags argument is the wrong type; RQF_PM is 
> > defined as req_flags_t, not blk_mq_req_flags_t.  It is associated with a 
> > particular request after the request has been created, so after 
> > blk_queue_enter() has been called.
> > 
> > How can we solve this?
> 
> The current code looks a bit weird because my focus when modifying the PM
> code has been on not breaking any existing code.
> 
> scsi_device_quiesce() relies on blk_queue_enter() processing all PREEMPT
> requests. A difficulty is that scsi_device_quiesce() is used for two
> separate purposes:
> * Runtime power management.
> * SCSI domain validation. See e.g. https://lwn.net/Articles/75917/.
> 
> I think that modifying blk_queue_enter() such that it only accepts PM
> requests will require to split scsi_device_quiesce() into two functions:
> one function that is used by the runtime power management code and another
> function that is used by the SCSI domain validation code. This may require
> to introduce new SCSI device states. If new SCSI device states are
> introduced, that should be done without modifying the state that is
> reported to user space. See also sdev_states[] and show_state_field in
> scsi_sysfs.c.

It may not need to be that complicated.  what about something like this?

Alan Stern


Index: usb-devel/block/blk-core.c
===================================================================
--- usb-devel.orig/block/blk-core.c
+++ usb-devel/block/blk-core.c
@@ -420,11 +420,11 @@ EXPORT_SYMBOL(blk_cleanup_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
+	const bool pm = flags & BLK_MQ_REQ_PM;
 
 	while (true) {
 		bool success = false;
@@ -626,7 +626,8 @@ struct request *blk_get_request(struct r
 	struct request *req;
 
 	WARN_ON_ONCE(op & REQ_NOWAIT);
-	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
+	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
+			BLK_MQ_REQ_PM));
 
 	req = blk_mq_alloc_request(q, op, flags);
 	if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
Index: usb-devel/drivers/scsi/scsi_lib.c
===================================================================
--- usb-devel.orig/drivers/scsi/scsi_lib.c
+++ usb-devel/drivers/scsi/scsi_lib.c
@@ -245,11 +245,15 @@ int __scsi_execute(struct scsi_device *s
 {
 	struct request *req;
 	struct scsi_request *rq;
+	blk_mq_req_flags_t mq_req_flags;
 	int ret = DRIVER_ERROR << 24;
 
+	mq_req_flags = BLK_MQ_REQ_PREEMPT;
+	if (rq_flags & RQF_PM)
+		mq_req_flags |= BLK_MQ_REQ_PM;
 	req = blk_get_request(sdev->request_queue,
 			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
+			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, mq_req_flags);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
Index: usb-devel/include/linux/blk-mq.h
===================================================================
--- usb-devel.orig/include/linux/blk-mq.h
+++ usb-devel/include/linux/blk-mq.h
@@ -435,6 +435,8 @@ enum {
 	BLK_MQ_REQ_RESERVED	= (__force blk_mq_req_flags_t)(1 << 1),
 	/* set RQF_PREEMPT */
 	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
+	/* used for power management */
+	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 4),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-27 20:33           ` Alan Stern
@ 2020-08-28  3:27             ` Bart Van Assche
  2020-08-28 15:37               ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-08-28  3:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen

On 2020-08-27 13:33, Alan Stern wrote:
> It may not need to be that complicated.  what about something like this?
> 
> Index: usb-devel/block/blk-core.c
> ===================================================================
> --- usb-devel.orig/block/blk-core.c
> +++ usb-devel/block/blk-core.c
> @@ -420,11 +420,11 @@ EXPORT_SYMBOL(blk_cleanup_queue);
>  /**
>   * blk_queue_enter() - try to increase q->q_usage_counter
>   * @q: request queue pointer
> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM
>   */
>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  {
> -	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
> +	const bool pm = flags & BLK_MQ_REQ_PM;
>  
>  	while (true) {
>  		bool success = false;
> @@ -626,7 +626,8 @@ struct request *blk_get_request(struct r
>  	struct request *req;
>  
>  	WARN_ON_ONCE(op & REQ_NOWAIT);
> -	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
> +	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
> +			BLK_MQ_REQ_PM));
>  
>  	req = blk_mq_alloc_request(q, op, flags);
>  	if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
> Index: usb-devel/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-devel.orig/drivers/scsi/scsi_lib.c
> +++ usb-devel/drivers/scsi/scsi_lib.c
> @@ -245,11 +245,15 @@ int __scsi_execute(struct scsi_device *s
>  {
>  	struct request *req;
>  	struct scsi_request *rq;
> +	blk_mq_req_flags_t mq_req_flags;
>  	int ret = DRIVER_ERROR << 24;
>  
> +	mq_req_flags = BLK_MQ_REQ_PREEMPT;
> +	if (rq_flags & RQF_PM)
> +		mq_req_flags |= BLK_MQ_REQ_PM;
>  	req = blk_get_request(sdev->request_queue,
>  			data_direction == DMA_TO_DEVICE ?
> -			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
> +			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, mq_req_flags);
>  	if (IS_ERR(req))
>  		return ret;
>  	rq = scsi_req(req);
> Index: usb-devel/include/linux/blk-mq.h
> ===================================================================
> --- usb-devel.orig/include/linux/blk-mq.h
> +++ usb-devel/include/linux/blk-mq.h
> @@ -435,6 +435,8 @@ enum {
>  	BLK_MQ_REQ_RESERVED	= (__force blk_mq_req_flags_t)(1 << 1),
>  	/* set RQF_PREEMPT */
>  	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
> +	/* used for power management */
> +	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 4),
>  };
>  
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,

I think this patch will break SCSI domain validation. The SCSI domain
validation code calls scsi_device_quiesce() and that function in turn calls
blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with
the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests
while blk_set_pm_only() is in effect, I think the above patch will cause the
SCSI domain validation code to deadlock.

Bart.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-28  3:27             ` Bart Van Assche
@ 2020-08-28 15:37               ` Alan Stern
  2020-08-29  0:51                 ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2020-08-28 15:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen,
	SCSI development list

On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:
> On 2020-08-27 13:33, Alan Stern wrote:
> > It may not need to be that complicated.  what about something like this?

> I think this patch will break SCSI domain validation. The SCSI domain
> validation code calls scsi_device_quiesce() and that function in turn calls
> blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with
> the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests
> while blk_set_pm_only() is in effect, I think the above patch will cause the
> SCSI domain validation code to deadlock.

Yes, you're right.

There may be an even simpler solution: Ensure that SCSI domain 
validation is mutually exclusive with runtime PM.  It's already mutually 
exclusive with system PM, so this makes sense.

What do you think of the patch below?

Alan Stern


Index: usb-devel/drivers/scsi/scsi_transport_spi.c
===================================================================
--- usb-devel.orig/drivers/scsi/scsi_transport_spi.c
+++ usb-devel/drivers/scsi/scsi_transport_spi.c
@@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev)
 	 * Because this function and the power management code both call
 	 * scsi_device_quiesce(), it is not safe to perform domain validation
 	 * while suspend or resume is in progress. Hence the
-	 * lock/unlock_system_sleep() calls.
+	 * lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
 	 */
 	lock_system_sleep();
 
@@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev)
 	if (unlikely(!buffer))
 		goto out_put;
 
+	if (scsi_autopm_get_device(sdev))
+		goto out_free;
+
 	/* We need to verify that the actual device will quiesce; the
 	 * later target quiesce is just a nice to have */
 	if (unlikely(scsi_device_quiesce(sdev)))
-		goto out_free;
+		goto out_autopm_put;
 
 	scsi_target_quiesce(starget);
 
@@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev)
 
 	spi_initial_dv(starget) = 1;
 
+ out_autopm_put:
+	scsi_autopm_put_device(sdev);
  out_free:
 	kfree(buffer);
  out_put:



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-28 15:37               ` Alan Stern
@ 2020-08-29  0:51                 ` Bart Van Assche
  2020-08-29  1:12                   ` Alan Stern
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2020-08-29  0:51 UTC (permalink / raw)
  To: Alan Stern
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen,
	SCSI development list

On 2020-08-28 08:37, Alan Stern wrote:
> On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:
>> On 2020-08-27 13:33, Alan Stern wrote:
>>> It may not need to be that complicated.  what about something like this?
> 
>> I think this patch will break SCSI domain validation. The SCSI domain
>> validation code calls scsi_device_quiesce() and that function in turn calls
>> blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with
>> the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests
>> while blk_set_pm_only() is in effect, I think the above patch will cause the
>> SCSI domain validation code to deadlock.
> 
> Yes, you're right.
> 
> There may be an even simpler solution: Ensure that SCSI domain 
> validation is mutually exclusive with runtime PM.  It's already mutually 
> exclusive with system PM, so this makes sense.
> 
> What do you think of the patch below?
> 
> Alan Stern
> 
> 
> Index: usb-devel/drivers/scsi/scsi_transport_spi.c
> ===================================================================
> --- usb-devel.orig/drivers/scsi/scsi_transport_spi.c
> +++ usb-devel/drivers/scsi/scsi_transport_spi.c
> @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev)
>  	 * Because this function and the power management code both call
>  	 * scsi_device_quiesce(), it is not safe to perform domain validation
>  	 * while suspend or resume is in progress. Hence the
> -	 * lock/unlock_system_sleep() calls.
> +	 * lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
>  	 */
>  	lock_system_sleep();
>  
> @@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev)
>  	if (unlikely(!buffer))
>  		goto out_put;
>  
> +	if (scsi_autopm_get_device(sdev))
> +		goto out_free;
> +
>  	/* We need to verify that the actual device will quiesce; the
>  	 * later target quiesce is just a nice to have */
>  	if (unlikely(scsi_device_quiesce(sdev)))
> -		goto out_free;
> +		goto out_autopm_put;
>  
>  	scsi_target_quiesce(starget);
>  
> @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev)
>  
>  	spi_initial_dv(starget) = 1;
>  
> + out_autopm_put:
> +	scsi_autopm_put_device(sdev);
>   out_free:
>  	kfree(buffer);
>   out_put:

Hi Alan,

I think this is only a part of the solution. scsi_target_quiesce() invokes
scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So
I think that the above is not sufficient to fix the deadlock mentioned in
my previous email.

Maybe it is possible to fix this by creating a new request queue and by
submitting the DV SCSI commands to that new request queue. There may be
better solutions.

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-29  0:51                 ` Bart Van Assche
@ 2020-08-29  1:12                   ` Alan Stern
  2020-08-29  2:57                     ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Alan Stern @ 2020-08-29  1:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen,
	SCSI development list

On Fri, Aug 28, 2020 at 05:51:03PM -0700, Bart Van Assche wrote:
> On 2020-08-28 08:37, Alan Stern wrote:
> > On Thu, Aug 27, 2020 at 08:27:49PM -0700, Bart Van Assche wrote:
> >> On 2020-08-27 13:33, Alan Stern wrote:
> >>> It may not need to be that complicated.  what about something like this?
> > 
> >> I think this patch will break SCSI domain validation. The SCSI domain
> >> validation code calls scsi_device_quiesce() and that function in turn calls
> >> blk_set_pm_only(). The SCSI domain validation code submits SCSI commands with
> >> the BLK_MQ_REQ_PREEMPT flag. Since the above code postpones such requests
> >> while blk_set_pm_only() is in effect, I think the above patch will cause the
> >> SCSI domain validation code to deadlock.
> > 
> > Yes, you're right.
> > 
> > There may be an even simpler solution: Ensure that SCSI domain 
> > validation is mutually exclusive with runtime PM.  It's already mutually 
> > exclusive with system PM, so this makes sense.
> > 
> > What do you think of the patch below?
> > 
> > Alan Stern
> > 
> > 
> > Index: usb-devel/drivers/scsi/scsi_transport_spi.c
> > ===================================================================
> > --- usb-devel.orig/drivers/scsi/scsi_transport_spi.c
> > +++ usb-devel/drivers/scsi/scsi_transport_spi.c
> > @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev)
> >  	 * Because this function and the power management code both call
> >  	 * scsi_device_quiesce(), it is not safe to perform domain validation
> >  	 * while suspend or resume is in progress. Hence the
> > -	 * lock/unlock_system_sleep() calls.
> > +	 * lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
> >  	 */
> >  	lock_system_sleep();
> >  
> > @@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev)
> >  	if (unlikely(!buffer))
> >  		goto out_put;
> >  
> > +	if (scsi_autopm_get_device(sdev))
> > +		goto out_free;
> > +
> >  	/* We need to verify that the actual device will quiesce; the
> >  	 * later target quiesce is just a nice to have */
> >  	if (unlikely(scsi_device_quiesce(sdev)))
> > -		goto out_free;
> > +		goto out_autopm_put;
> >  
> >  	scsi_target_quiesce(starget);
> >  
> > @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev)
> >  
> >  	spi_initial_dv(starget) = 1;
> >  
> > + out_autopm_put:
> > +	scsi_autopm_put_device(sdev);
> >   out_free:
> >  	kfree(buffer);
> >   out_put:
> 
> Hi Alan,
> 
> I think this is only a part of the solution. scsi_target_quiesce() invokes
> scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So
> I think that the above is not sufficient to fix the deadlock mentioned in
> my previous email.

Sorry, it sounds like you misinterpreted my preceding email.  I meant to 
suggest that the patch above should be considered _instead of_ the patch 
that introduced BLK_MQ_REQ_PM.  So blk_queue_enter() would remain 
unchanged, using the BLK_MQ_REQ_PREEMPT flag to decide whether or not to 
postpone new requests.  Thus the deadlock you're concerned about would 
not arise.

> Maybe it is possible to fix this by creating a new request queue and by
> submitting the DV SCSI commands to that new request queue. There may be
> better solutions.

I don't think that is necessary.  After all, quiescing is quiescing, 
whether it is done for runtime power management, domain validation, or 
anything else.  What we need to avoid is one thread trying to keep a 
device quiescent at the same time that another thread is re-activating 
it.

To some extent the SCSI core addresses this by allowing only one thread 
to put a device into the SDEV_QUIESCE state at a time.  Making domain 
validation mutually exclusive with runtime suspend is my attempt to 
prevent any odd corner-case interactions from cropping up.

Alan Stern

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] block: Fix a race in the runtime power management code
  2020-08-29  1:12                   ` Alan Stern
@ 2020-08-29  2:57                     ` Bart Van Assche
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2020-08-29  2:57 UTC (permalink / raw)
  To: Alan Stern
  Cc: Stanley Chu, Jens Axboe, linux-block, Christoph Hellwig,
	Ming Lei, stable, Can Guo, Martin K. Petersen,
	SCSI development list

On 2020-08-28 18:12, Alan Stern wrote:
> On Fri, Aug 28, 2020 at 05:51:03PM -0700, Bart Van Assche wrote:
>> On 2020-08-28 08:37, Alan Stern wrote:
>>> There may be an even simpler solution: Ensure that SCSI domain 
>>> validation is mutually exclusive with runtime PM.  It's already mutually 
>>> exclusive with system PM, so this makes sense.
>>>
>>> What do you think of the patch below?
>>>
>>> Alan Stern
>>>
>>>
>>> Index: usb-devel/drivers/scsi/scsi_transport_spi.c
>>> ===================================================================
>>> --- usb-devel.orig/drivers/scsi/scsi_transport_spi.c
>>> +++ usb-devel/drivers/scsi/scsi_transport_spi.c
>>> @@ -1001,7 +1001,7 @@ spi_dv_device(struct scsi_device *sdev)
>>>  	 * Because this function and the power management code both call
>>>  	 * scsi_device_quiesce(), it is not safe to perform domain validation
>>>  	 * while suspend or resume is in progress. Hence the
>>> -	 * lock/unlock_system_sleep() calls.
>>> +	 * lock/unlock_system_sleep() and scsi_autopm_get/put_device() calls.
>>>  	 */
>>>  	lock_system_sleep();
>>>  
>>> @@ -1018,10 +1018,13 @@ spi_dv_device(struct scsi_device *sdev)
>>>  	if (unlikely(!buffer))
>>>  		goto out_put;
>>>  
>>> +	if (scsi_autopm_get_device(sdev))
>>> +		goto out_free;
>>> +
>>>  	/* We need to verify that the actual device will quiesce; the
>>>  	 * later target quiesce is just a nice to have */
>>>  	if (unlikely(scsi_device_quiesce(sdev)))
>>> -		goto out_free;
>>> +		goto out_autopm_put;
>>>  
>>>  	scsi_target_quiesce(starget);
>>>  
>>> @@ -1041,6 +1044,8 @@ spi_dv_device(struct scsi_device *sdev)
>>>  
>>>  	spi_initial_dv(starget) = 1;
>>>  
>>> + out_autopm_put:
>>> +	scsi_autopm_put_device(sdev);
>>>   out_free:
>>>  	kfree(buffer);
>>>   out_put:
>>
>> Hi Alan,
>>
>> I think this is only a part of the solution. scsi_target_quiesce() invokes
>> scsi_device_quiesce() and that function in turn calls blk_set_pm_only(). So
>> I think that the above is not sufficient to fix the deadlock mentioned in
>> my previous email.
> 
> Sorry, it sounds like you misinterpreted my preceding email.  I meant to 
> suggest that the patch above should be considered _instead of_ the patch 
> that introduced BLK_MQ_REQ_PM.  So blk_queue_enter() would remain 
> unchanged, using the BLK_MQ_REQ_PREEMPT flag to decide whether or not to 
> postpone new requests.  Thus the deadlock you're concerned about would 
> not arise.

Hi Alan,

Thanks for the clarification. I agree that the above patch should serialize
SCSI DV and runtime PM. I'm fine with the above patch.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-08-29  2:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-24  3:06 [PATCH] block: Fix a race in the runtime power management code Bart Van Assche
2020-08-24 14:47 ` Alan Stern
2020-08-25  9:01 ` Stanley Chu
2020-08-25  9:11 ` Stanley Chu
2020-08-25 18:24   ` Alan Stern
2020-08-25 22:22     ` Bart Van Assche
2020-08-26  1:51       ` Alan Stern
2020-08-27  3:35         ` Bart Van Assche
2020-08-27 20:33           ` Alan Stern
2020-08-28  3:27             ` Bart Van Assche
2020-08-28 15:37               ` Alan Stern
2020-08-29  0:51                 ` Bart Van Assche
2020-08-29  1:12                   ` Alan Stern
2020-08-29  2:57                     ` Bart Van Assche
2020-08-26  2:58   ` Bart Van Assche
2020-08-26  4:00     ` Stanley Chu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.