All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-25  8:10 ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-25  8:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, Xiao Liang, Ming Lei, jianchao.wang, Sagi Grimberg,
	Keith Busch, stable

If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
requeuing it, and it should be completed immediately. Even simply from
the flag name, it needn't to be requeued.

Otherwise, it is easy to cause IO hang when IO is timed out in case of
PCI NVMe:

1) IO timeout is triggered, and nvme_timeout() tries to disable
device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)

2) inside nvme_dev_disable(), queue is frozen and quiesced, and
try to cancel every request, but the timeout request can't be canceled
since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().

3) this timeout req is requeued via nvme_complete_rq(), but can't be
dispatched at all because queue is quiesced and hardware isn't ready,
finally nvme_wait_freeze() waits for ever in nvme_reset_work().

Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: stable@vger.kernel.org
Reported-by: Xiao Liang <xiliang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0ff03cf95f7f..5cd713a164cb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -210,6 +210,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
 		return false;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		return false;
 	return true;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-25  8:10 ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-25  8:10 UTC (permalink / raw)


If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
requeuing it, and it should be completed immediately. Even simply from
the flag name, it needn't to be requeued.

Otherwise, it is easy to cause IO hang when IO is timed out in case of
PCI NVMe:

1) IO timeout is triggered, and nvme_timeout() tries to disable
device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)

2) inside nvme_dev_disable(), queue is frozen and quiesced, and
try to cancel every request, but the timeout request can't be canceled
since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().

3) this timeout req is requeued via nvme_complete_rq(), but can't be
dispatched at all because queue is quiesced and hardware isn't ready,
finally nvme_wait_freeze() waits for ever in nvme_reset_work().

Cc: "jianchao.wang" <jianchao.w.wang at oracle.com>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Keith Busch <keith.busch at intel.com>
Cc: stable at vger.kernel.org
Reported-by: Xiao Liang <xiliang at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0ff03cf95f7f..5cd713a164cb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -210,6 +210,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
 		return false;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		return false;
 	return true;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-25  8:10 ` Ming Lei
@ 2018-01-25  8:52   ` jianchao.wang
  -1 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-25  8:52 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Sagi Grimberg, Xiao Liang, linux-nvme, Keith Busch, stable

Hi Ming

Thanks for your patch and detailed comment.

On 01/25/2018 04:10 PM, Ming Lei wrote:
> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> requeuing it, and it should be completed immediately. Even simply from
> the flag name, it needn't to be requeued.
> 
> Otherwise, it is easy to cause IO hang when IO is timed out in case of
> PCI NVMe:
> 
> 1) IO timeout is triggered, and nvme_timeout() tries to disable
> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> 
> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> try to cancel every request, but the timeout request can't be canceled
> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> 
> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> dispatched at all because queue is quiesced and hardware isn't ready,
> finally nvme_wait_freeze() waits for ever in nvme_reset_work().

The queue will be unquiesced by nvme_start_queues() before start to wait freeze.

nvme_reset_work
....
		nvme_start_queues(&dev->ctrl);
		nvme_wait_freeze(&dev->ctrl);
		/* hit this only when allocate tagset fails */
		if (nvme_dev_add(dev))
			new_state = NVME_CTRL_ADMIN_ONLY;
		nvme_unfreeze(&dev->ctrl);
....
And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....

On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
Such as following:

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..60469cd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
 
+/*
+ * When returns, the timeout_work will be invoked any more.
+ * The caller must take the charge of the outstanding IOs.
+ */
+void blk_quiesce_queue_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	synchronize_rcu_bh();
+	cancel_work_sync(&q->timeout_work);
+	/*
+	 * The timeout work will not be invoked anymore.
+	 */
+	del_timer_sync(&q->timeout);
+	/*
+	 * If the caller doesn't want the timer to be re-armed,
+	 * it has to stop/quiesce the queue.
+	 */
+}
+EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
+
+/*
+ * Caller must ensure all the outstanding IOs has been handled.
+ */
+void blk_unquiesce_queue_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
 void blk_set_queue_dying(struct request_queue *q)
 {
 	spin_lock_irq(q->queue_lock);
@@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
 	struct request_queue *q = from_timer(q, t, timeout);
 
+	if (blk_queue_timeout_quiesced(q))
+		return;
+
 	kblockd_schedule_work(&q->timeout_work);
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-25  8:52   ` jianchao.wang
  0 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-25  8:52 UTC (permalink / raw)


Hi Ming

Thanks for your patch and detailed comment.

On 01/25/2018 04:10 PM, Ming Lei wrote:
> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> requeuing it, and it should be completed immediately. Even simply from
> the flag name, it needn't to be requeued.
> 
> Otherwise, it is easy to cause IO hang when IO is timed out in case of
> PCI NVMe:
> 
> 1) IO timeout is triggered, and nvme_timeout() tries to disable
> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> 
> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> try to cancel every request, but the timeout request can't be canceled
> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> 
> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> dispatched at all because queue is quiesced and hardware isn't ready,
> finally nvme_wait_freeze() waits for ever in nvme_reset_work().

The queue will be unquiesced by nvme_start_queues() before start to wait freeze.

nvme_reset_work
....
		nvme_start_queues(&dev->ctrl);
		nvme_wait_freeze(&dev->ctrl);
		/* hit this only when allocate tagset fails */
		if (nvme_dev_add(dev))
			new_state = NVME_CTRL_ADMIN_ONLY;
		nvme_unfreeze(&dev->ctrl);
....
And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....

On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
Such as following:

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..60469cd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
 
+/*
+ * When returns, the timeout_work will be invoked any more.
+ * The caller must take the charge of the outstanding IOs.
+ */
+void blk_quiesce_queue_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	synchronize_rcu_bh();
+	cancel_work_sync(&q->timeout_work);
+	/*
+	 * The timeout work will not be invoked anymore.
+	 */
+	del_timer_sync(&q->timeout);
+	/*
+	 * If the caller doesn't want the timer to be re-armed,
+	 * it has to stop/quiesce the queue.
+	 */
+}
+EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
+
+/*
+ * Caller must ensure all the outstanding IOs has been handled.
+ */
+void blk_unquiesce_queue_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
 void blk_set_queue_dying(struct request_queue *q)
 {
 	spin_lock_irq(q->queue_lock);
@@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
 	struct request_queue *q = from_timer(q, t, timeout);
 
+	if (blk_queue_timeout_quiesced(q))
+		return;
+
 	kblockd_schedule_work(&q->timeout_work);
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-25  8:52   ` jianchao.wang
@ 2018-01-25 10:15     ` Ming Lei
  -1 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-25 10:15 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Christoph Hellwig, Keith Busch, stable, Sagi Grimberg,
	linux-nvme, Xiao Liang

Hi Jianchao,

On Thu, Jan 25, 2018 at 04:52:31PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your patch and detailed comment.
> 
> On 01/25/2018 04:10 PM, Ming Lei wrote:
> > If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> > requeuing it, and it should be completed immediately. Even simply from
> > the flag name, it needn't to be requeued.
> > 
> > Otherwise, it is easy to cause IO hang when IO is timed out in case of
> > PCI NVMe:
> > 
> > 1) IO timeout is triggered, and nvme_timeout() tries to disable
> > device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> > 
> > 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> > try to cancel every request, but the timeout request can't be canceled
> > since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> > 
> > 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> > dispatched at all because queue is quiesced and hardware isn't ready,
> > finally nvme_wait_freeze() waits for ever in nvme_reset_work().
> 
> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.

Hammmm, just miss this point, so no such io hang in this case.

> 
> nvme_reset_work
> ....
> 		nvme_start_queues(&dev->ctrl);
> 		nvme_wait_freeze(&dev->ctrl);
> 		/* hit this only when allocate tagset fails */
> 		if (nvme_dev_add(dev))
> 			new_state = NVME_CTRL_ADMIN_ONLY;
> 		nvme_unfreeze(&dev->ctrl);
> ....
> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
> 

Yes, the implementation is really complicated, but nvme_dev_disable() is
run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
handles the timeout request which is invisible to nvme_dev_disable().

So in concept, looks it is fine, :-)

> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?

Could you explain a bit why it is required?


> Such as following:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b888175..60469cd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
>  }
>  EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
>  
> +/*
> + * When returns, the timeout_work will be invoked any more.
> + * The caller must take the charge of the outstanding IOs.
> + */
> +void blk_quiesce_queue_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	synchronize_rcu_bh();
> +	cancel_work_sync(&q->timeout_work);
> +	/*
> +	 * The timeout work will not be invoked anymore.
> +	 */
> +	del_timer_sync(&q->timeout);
> +	/*
> +	 * If the caller doesn't want the timer to be re-armed,
> +	 * it has to stop/quiesce the queue.
> +	 */
> +}
> +EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
> +
> +/*
> + * Caller must ensure all the outstanding IOs has been handled.
> + */
> +void blk_unquiesce_queue_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +}
> +
> +EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
>  void blk_set_queue_dying(struct request_queue *q)
>  {
>  	spin_lock_irq(q->queue_lock);
> @@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
>  {
>  	struct request_queue *q = from_timer(q, t, timeout);
>  
> +	if (blk_queue_timeout_quiesced(q))
> +		return;
> +
>  	kblockd_schedule_work(&q->timeout_work);
>  }
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

-- 
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-25 10:15     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-25 10:15 UTC (permalink / raw)


Hi Jianchao,

On Thu, Jan 25, 2018@04:52:31PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your patch and detailed comment.
> 
> On 01/25/2018 04:10 PM, Ming Lei wrote:
> > If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> > requeuing it, and it should be completed immediately. Even simply from
> > the flag name, it needn't to be requeued.
> > 
> > Otherwise, it is easy to cause IO hang when IO is timed out in case of
> > PCI NVMe:
> > 
> > 1) IO timeout is triggered, and nvme_timeout() tries to disable
> > device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> > 
> > 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> > try to cancel every request, but the timeout request can't be canceled
> > since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> > 
> > 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> > dispatched at all because queue is quiesced and hardware isn't ready,
> > finally nvme_wait_freeze() waits for ever in nvme_reset_work().
> 
> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.

Hammmm, just miss this point, so no such io hang in this case.

> 
> nvme_reset_work
> ....
> 		nvme_start_queues(&dev->ctrl);
> 		nvme_wait_freeze(&dev->ctrl);
> 		/* hit this only when allocate tagset fails */
> 		if (nvme_dev_add(dev))
> 			new_state = NVME_CTRL_ADMIN_ONLY;
> 		nvme_unfreeze(&dev->ctrl);
> ....
> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
> 

Yes, the implementation is really complicated, but nvme_dev_disable() is
run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
handles the timeout request which is invisible to nvme_dev_disable().

So in concept, looks it is fine, :-)

> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?

Could you explain a bit why it is required?


> Such as following:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b888175..60469cd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
>  }
>  EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
>  
> +/*
> + * When returns, the timeout_work will be invoked any more.
> + * The caller must take the charge of the outstanding IOs.
> + */
> +void blk_quiesce_queue_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	synchronize_rcu_bh();
> +	cancel_work_sync(&q->timeout_work);
> +	/*
> +	 * The timeout work will not be invoked anymore.
> +	 */
> +	del_timer_sync(&q->timeout);
> +	/*
> +	 * If the caller doesn't want the timer to be re-armed,
> +	 * it has to stop/quiesce the queue.
> +	 */
> +}
> +EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
> +
> +/*
> + * Caller must ensure all the outstanding IOs has been handled.
> + */
> +void blk_unquiesce_queue_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +}
> +
> +EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
>  void blk_set_queue_dying(struct request_queue *q)
>  {
>  	spin_lock_irq(q->queue_lock);
> @@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
>  {
>  	struct request_queue *q = from_timer(q, t, timeout);
>  
> +	if (blk_queue_timeout_quiesced(q))
> +		return;
> +
>  	kblockd_schedule_work(&q->timeout_work);
>  }
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

-- 
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-25 10:15     ` Ming Lei
@ 2018-01-27 12:33       ` jianchao.wang
  -1 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-27 12:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, stable, Sagi Grimberg,
	linux-nvme, Xiao Liang

Hi ming

Thanks for your kindly response.
And really sorry for delayed feedback here.

On 01/25/2018 06:15 PM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Jan 25, 2018 at 04:52:31PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your patch and detailed comment.
>>
>> On 01/25/2018 04:10 PM, Ming Lei wrote:
>>> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
>>> requeuing it, and it should be completed immediately. Even simply from
>>> the flag name, it needn't to be requeued.
>>>
>>> Otherwise, it is easy to cause IO hang when IO is timed out in case of
>>> PCI NVMe:
>>>
>>> 1) IO timeout is triggered, and nvme_timeout() tries to disable
>>> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
>>>
>>> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
>>> try to cancel every request, but the timeout request can't be canceled
>>> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
>>>
>>> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
>>> dispatched at all because queue is quiesced and hardware isn't ready,
>>> finally nvme_wait_freeze() waits for ever in nvme_reset_work().
>>
>> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.
> 
> Hammmm, just miss this point, so no such io hang in this case.
> 
>>
>> nvme_reset_work
>> ....
>> 		nvme_start_queues(&dev->ctrl);
>> 		nvme_wait_freeze(&dev->ctrl);
>> 		/* hit this only when allocate tagset fails */
>> 		if (nvme_dev_add(dev))
>> 			new_state = NVME_CTRL_ADMIN_ONLY;
>> 		nvme_unfreeze(&dev->ctrl);
>> ....
>> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
>> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
>> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
>>
> 
> Yes, the implementation is really complicated, but nvme_dev_disable() is
> run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
> handles the timeout request which is invisible to nvme_dev_disable()
> 
> So in concept, looks it is fine, :-)

In fact, I have incurred a scenario which is similar with description as above.
nvme_dev_disable issue sq/cq delete command on amdinq, but nvme device didn't response due to some unknown reason.
At the moment, the previous IO requests expired, and nvne_timeout is invoked when ctrl->state is RESETTING, then
nvme_dev_disable was invoked. So got a scenario:

context A                                   context B
nvme_dev_disable                            nvme_timeout
  hold shutdown_lock                          nvme_dev_disable
    wait sq/cq delete command response          require shutdown_lock 

Finally, the context A broke out due to timeout, things went ok.
But this is a very dangerous.
    
There are other cases where context may sleep with shutdown_lock held in nvme_dev_disable

>> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
> 
> Could you explain a bit why it is required?
For nvme_dev_disable, it need to cancel all the previously outstanding requests.
And it cannot wait them to be completed, because the device may goes wrong at the moment.
But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
The best way to fix this is to ensure the timeout path has been completed before cancel the
previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
it has to be fixed by other way.)

Following path to is do this. Quiesce the request_queue->timeout_work totally, then we don't need to race with
nvme_timeout path any more in above scenario.

> 
> 
>> Such as following:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index b888175..60469cd 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
>>  }
>>  EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
>>  
>> +/*
>> + * When returns, the timeout_work will be invoked any more.
>> + * The caller must take the charge of the outstanding IOs.
>> + */
>> +void blk_quiesce_queue_timeout(struct request_queue *q)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(q->queue_lock, flags);
>> +	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
>> +	spin_unlock_irqrestore(q->queue_lock, flags);
>> +
>> +	synchronize_rcu_bh();
>> +	cancel_work_sync(&q->timeout_work);
>> +	/*
>> +	 * The timeout work will not be invoked anymore.
>> +	 */
>> +	del_timer_sync(&q->timeout);
>> +	/*
>> +	 * If the caller doesn't want the timer to be re-armed,
>> +	 * it has to stop/quiesce the queue.
>> +	 */
>> +}
>> +EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
>> +
>> +/*
>> + * Caller must ensure all the outstanding IOs has been handled.
>> + */
>> +void blk_unquiesce_queue_timeout(struct request_queue *q)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(q->queue_lock, flags);
>> +	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
>> +	spin_unlock_irqrestore(q->queue_lock, flags);
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
>>  void blk_set_queue_dying(struct request_queue *q)
>>  {
>>  	spin_lock_irq(q->queue_lock);
>> @@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
>>  {
>>  	struct request_queue *q = from_timer(q, t, timeout);
>>  
>> +	if (blk_queue_timeout_quiesced(q))
>> +		return;
>> +
>>  	kblockd_schedule_work(&q->timeout_work);
>>  }
>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ZXWFEWytJgigh5HwlzkcMrlwBTLl73tTNVA1fhL2FDk&s=zXoLlfl9uF107FMTkPVyoKQd2QF1i0h9BdcAGlIw4LM&e=
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-27 12:33       ` jianchao.wang
  0 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-27 12:33 UTC (permalink / raw)


Hi ming

Thanks for your kindly response.
And really sorry for delayed feedback here.

On 01/25/2018 06:15 PM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Jan 25, 2018@04:52:31PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your patch and detailed comment.
>>
>> On 01/25/2018 04:10 PM, Ming Lei wrote:
>>> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
>>> requeuing it, and it should be completed immediately. Even simply from
>>> the flag name, it needn't to be requeued.
>>>
>>> Otherwise, it is easy to cause IO hang when IO is timed out in case of
>>> PCI NVMe:
>>>
>>> 1) IO timeout is triggered, and nvme_timeout() tries to disable
>>> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
>>>
>>> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
>>> try to cancel every request, but the timeout request can't be canceled
>>> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
>>>
>>> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
>>> dispatched at all because queue is quiesced and hardware isn't ready,
>>> finally nvme_wait_freeze() waits for ever in nvme_reset_work().
>>
>> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.
> 
> Hammmm, just miss this point, so no such io hang in this case.
> 
>>
>> nvme_reset_work
>> ....
>> 		nvme_start_queues(&dev->ctrl);
>> 		nvme_wait_freeze(&dev->ctrl);
>> 		/* hit this only when allocate tagset fails */
>> 		if (nvme_dev_add(dev))
>> 			new_state = NVME_CTRL_ADMIN_ONLY;
>> 		nvme_unfreeze(&dev->ctrl);
>> ....
>> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
>> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
>> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
>>
> 
> Yes, the implementation is really complicated, but nvme_dev_disable() is
> run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
> handles the timeout request which is invisible to nvme_dev_disable()
> 
> So in concept, looks it is fine, :-)

In fact, I have incurred a scenario which is similar with description as above.
nvme_dev_disable issue sq/cq delete command on amdinq, but nvme device didn't response due to some unknown reason.
At the moment, the previous IO requests expired, and nvne_timeout is invoked when ctrl->state is RESETTING, then
nvme_dev_disable was invoked. So got a scenario:

context A                                   context B
nvme_dev_disable                            nvme_timeout
  hold shutdown_lock                          nvme_dev_disable
    wait sq/cq delete command response          require shutdown_lock 

Finally, the context A broke out due to timeout, things went ok.
But this is a very dangerous.
    
There are other cases where context may sleep with shutdown_lock held in nvme_dev_disable

>> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
> 
> Could you explain a bit why it is required?
For nvme_dev_disable, it need to cancel all the previously outstanding requests.
And it cannot wait them to be completed, because the device may goes wrong at the moment.
But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
The best way to fix this is to ensure the timeout path has been completed before cancel the
previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
it has to be fixed by other way.)

Following path to is do this. Quiesce the request_queue->timeout_work totally, then we don't need to race with
nvme_timeout path any more in above scenario.

> 
> 
>> Such as following:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index b888175..60469cd 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -616,6 +616,44 @@ void blk_queue_bypass_end(struct request_queue *q)
>>  }
>>  EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
>>  
>> +/*
>> + * When returns, the timeout_work will be invoked any more.
>> + * The caller must take the charge of the outstanding IOs.
>> + */
>> +void blk_quiesce_queue_timeout(struct request_queue *q)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(q->queue_lock, flags);
>> +	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
>> +	spin_unlock_irqrestore(q->queue_lock, flags);
>> +
>> +	synchronize_rcu_bh();
>> +	cancel_work_sync(&q->timeout_work);
>> +	/*
>> +	 * The timeout work will not be invoked anymore.
>> +	 */
>> +	del_timer_sync(&q->timeout);
>> +	/*
>> +	 * If the caller doesn't want the timer to be re-armed,
>> +	 * it has to stop/quiesce the queue.
>> +	 */
>> +}
>> +EXPORT_SYMBOL_GPL(blk_quiesce_queue_timeout);
>> +
>> +/*
>> + * Caller must ensure all the outstanding IOs has been handled.
>> + */
>> +void blk_unquiesce_queue_timeout(struct request_queue *q)
>> +{
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(q->queue_lock, flags);
>> +	queue_flag_clear(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
>> +	spin_unlock_irqrestore(q->queue_lock, flags);
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(blk_unquiesce_queue_timeout);
>>  void blk_set_queue_dying(struct request_queue *q)
>>  {
>>  	spin_lock_irq(q->queue_lock);
>> @@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
>>  {
>>  	struct request_queue *q = from_timer(q, t, timeout);
>>  
>> +	if (blk_queue_timeout_quiesced(q))
>> +		return;
>> +
>>  	kblockd_schedule_work(&q->timeout_work);
>>  }
>>
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=ZXWFEWytJgigh5HwlzkcMrlwBTLl73tTNVA1fhL2FDk&s=zXoLlfl9uF107FMTkPVyoKQd2QF1i0h9BdcAGlIw4LM&e=
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-27 12:33       ` jianchao.wang
@ 2018-01-27 13:31         ` Ming Lei
  -1 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-27 13:31 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Christoph Hellwig, Keith Busch, stable, Sagi Grimberg,
	linux-nvme, Xiao Liang

Hi Jianchao,

On Sat, Jan 27, 2018 at 08:33:35PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> And really sorry for delayed feedback here.
> 
> On 01/25/2018 06:15 PM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Thu, Jan 25, 2018 at 04:52:31PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> Thanks for your patch and detailed comment.
> >>
> >> On 01/25/2018 04:10 PM, Ming Lei wrote:
> >>> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> >>> requeuing it, and it should be completed immediately. Even simply from
> >>> the flag name, it needn't to be requeued.
> >>>
> >>> Otherwise, it is easy to cause IO hang when IO is timed out in case of
> >>> PCI NVMe:
> >>>
> >>> 1) IO timeout is triggered, and nvme_timeout() tries to disable
> >>> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> >>>
> >>> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> >>> try to cancel every request, but the timeout request can't be canceled
> >>> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> >>>
> >>> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> >>> dispatched at all because queue is quiesced and hardware isn't ready,
> >>> finally nvme_wait_freeze() waits for ever in nvme_reset_work().
> >>
> >> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.
> > 
> > Hammmm, just miss this point, so no such io hang in this case.
> > 
> >>
> >> nvme_reset_work
> >> ....
> >> 		nvme_start_queues(&dev->ctrl);
> >> 		nvme_wait_freeze(&dev->ctrl);
> >> 		/* hit this only when allocate tagset fails */
> >> 		if (nvme_dev_add(dev))
> >> 			new_state = NVME_CTRL_ADMIN_ONLY;
> >> 		nvme_unfreeze(&dev->ctrl);
> >> ....
> >> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
> >> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
> >> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
> >>
> > 
> > Yes, the implementation is really complicated, but nvme_dev_disable() is
> > run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
> > handles the timeout request which is invisible to nvme_dev_disable()
> > 
> > So in concept, looks it is fine, :-)
> 
> In fact, I have incurred a scenario which is similar with description as above.
> nvme_dev_disable issue sq/cq delete command on amdinq, but nvme device didn't response due to some unknown reason.
> At the moment, the previous IO requests expired, and nvne_timeout is invoked when ctrl->state is RESETTING, then
> nvme_dev_disable was invoked. So got a scenario:
> 
> context A                                   context B
> nvme_dev_disable                            nvme_timeout
>   hold shutdown_lock                          nvme_dev_disable
>     wait sq/cq delete command response          require shutdown_lock 
> 
> Finally, the context A broke out due to timeout, things went ok.
> But this is a very dangerous.

I guess that is exactly why nvme_delete_queue() submits request via
blk_execute_rq_nowait() and wait_for_completion_io_timeout() is called
in nvme_disable_io_queues(). Maybe it is better to add comment on that,
and any requests submitted inside nvme_dev_disable() should have been
done in this way.

>     
> There are other cases where context may sleep with shutdown_lock held in nvme_dev_disable
> 
> >> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
> > 
> > Could you explain a bit why it is required?
> For nvme_dev_disable, it need to cancel all the previously outstanding requests.
> And it cannot wait them to be completed, because the device may goes wrong at the moment.

Yes, that is why blk_mq_tagset_busy_iter(nvme_cancel_request) is called
for making all in-flight requests becoming IDLE state and preparing for
requeue because the hardware is supposed to be DEAD.

So it doesn't depend on hardware to complete these request, and finally
once controller is resetted successfully, these requests will be submitted
to hardware again via requeue, and data loss can be avoided.

> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.

But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
between timeout and other completions, such as cancel.

> The best way to fix this is to ensure the timeout path has been completed before cancel the
> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
> it has to be fixed by other way.)

Maybe your approach looks a bit clean and simplify the implementation, but seems
it isn't necessary.

So could you explain a bit what the exact issue you are worrying about? deadlock?
or others?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-27 13:31         ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-27 13:31 UTC (permalink / raw)


Hi Jianchao,

On Sat, Jan 27, 2018@08:33:35PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Thanks for your kindly response.
> And really sorry for delayed feedback here.
> 
> On 01/25/2018 06:15 PM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Thu, Jan 25, 2018@04:52:31PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> Thanks for your patch and detailed comment.
> >>
> >> On 01/25/2018 04:10 PM, Ming Lei wrote:
> >>> If request is marked as NVME_REQ_CANCELLED, we don't need to retry for
> >>> requeuing it, and it should be completed immediately. Even simply from
> >>> the flag name, it needn't to be requeued.
> >>>
> >>> Otherwise, it is easy to cause IO hang when IO is timed out in case of
> >>> PCI NVMe:
> >>>
> >>> 1) IO timeout is triggered, and nvme_timeout() tries to disable
> >>> device(nvme_dev_disable) and reset controller(nvme_reset_ctrl)
> >>>
> >>> 2) inside nvme_dev_disable(), queue is frozen and quiesced, and
> >>> try to cancel every request, but the timeout request can't be canceled
> >>> since it is completed by __blk_mq_complete_request() in blk_mq_rq_timed_out().
> >>>
> >>> 3) this timeout req is requeued via nvme_complete_rq(), but can't be
> >>> dispatched at all because queue is quiesced and hardware isn't ready,
> >>> finally nvme_wait_freeze() waits for ever in nvme_reset_work().
> >>
> >> The queue will be unquiesced by nvme_start_queues() before start to wait freeze.
> > 
> > Hammmm, just miss this point, so no such io hang in this case.
> > 
> >>
> >> nvme_reset_work
> >> ....
> >> 		nvme_start_queues(&dev->ctrl);
> >> 		nvme_wait_freeze(&dev->ctrl);
> >> 		/* hit this only when allocate tagset fails */
> >> 		if (nvme_dev_add(dev))
> >> 			new_state = NVME_CTRL_ADMIN_ONLY;
> >> 		nvme_unfreeze(&dev->ctrl);
> >> ....
> >> And the relationship between nvme_timeout and nvme_dev_disable is really complicated.
> >> The nvme_timeout may run with nvme_dev_disable in parallel and also could invoke it.
> >> nvme_dev_disbale may sleep holding shutdown_lock, nvme_timeout will try to get it when invokes nvme_dev_disable....
> >>
> > 
> > Yes, the implementation is really complicated, but nvme_dev_disable() is
> > run exclusively because of dev->shutdown_lock. Also nvme_timeout() only
> > handles the timeout request which is invisible to nvme_dev_disable()
> > 
> > So in concept, looks it is fine, :-)
> 
> In fact, I have incurred a scenario which is similar with description as above.
> nvme_dev_disable issue sq/cq delete command on amdinq, but nvme device didn't response due to some unknown reason.
> At the moment, the previous IO requests expired, and nvne_timeout is invoked when ctrl->state is RESETTING, then
> nvme_dev_disable was invoked. So got a scenario:
> 
> context A                                   context B
> nvme_dev_disable                            nvme_timeout
>   hold shutdown_lock                          nvme_dev_disable
>     wait sq/cq delete command response          require shutdown_lock 
> 
> Finally, the context A broke out due to timeout, things went ok.
> But this is a very dangerous.

I guess that is exactly why nvme_delete_queue() submits request via
blk_execute_rq_nowait() and wait_for_completion_io_timeout() is called
in nvme_disable_io_queues(). Maybe it is better to add comment on that,
and any requests submitted inside nvme_dev_disable() should have been
done in this way.

>     
> There are other cases where context may sleep with shutdown_lock held in nvme_dev_disable
> 
> >> On the other hand, can we provide a helper interface to quiesce the request_work timeout_work ?
> > 
> > Could you explain a bit why it is required?
> For nvme_dev_disable, it need to cancel all the previously outstanding requests.
> And it cannot wait them to be completed, because the device may goes wrong at the moment.

Yes, that is why blk_mq_tagset_busy_iter(nvme_cancel_request) is called
for making all in-flight requests becoming IDLE state and preparing for
requeue because the hardware is supposed to be DEAD.

So it doesn't depend on hardware to complete these request, and finally
once controller is resetted successfully, these requests will be submitted
to hardware again via requeue, and data loss can be avoided.

> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.

But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
between timeout and other completions, such as cancel.

> The best way to fix this is to ensure the timeout path has been completed before cancel the
> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
> it has to be fixed by other way.)

Maybe your approach looks a bit clean and simplify the implementation, but seems
it isn't necessary.

So could you explain a bit what the exact issue you are worrying about? deadlock?
or others?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-27 13:31         ` Ming Lei
@ 2018-01-27 14:29           ` jianchao.wang
  -1 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-27 14:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, stable, Sagi Grimberg,
	linux-nvme, Xiao Liang

Hi ming 

Thanks for your detailed response.
That's really appreciated.

On 01/27/2018 09:31 PM, Ming Lei wrote:
>> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
> But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
> between timeout and other completions, such as cancel.
> Yes, I know blk_mq_complete_request could avoid the a request is accessed by timeout
path and other completion path concurrently. :)
What's I worry about is the timeout path could hold the expired request, so when
nvme_dev_disable return, we cannot ensure all the previous outstanding requests has been
handled. That's really bad. 

>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>> it has to be fixed by other way.)
> Maybe your approach looks a bit clean and simplify the implementation, but seems
> it isn't necessary.
> 
> So could you explain a bit what the exact issue you are worrying about? deadlock?
> or others?
There is indeed potential issue. But it is in very narrow window.
Please refer to https://lkml.org/lkml/2018/1/19/68

As you said, the approach looks a bit clean and simplify the implementation.
That's what I really want, break the complicated relationship between 
nvme_timeout and nvme_dev_diable. 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-27 14:29           ` jianchao.wang
  0 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-27 14:29 UTC (permalink / raw)


Hi ming 

Thanks for your detailed response.
That's really appreciated.

On 01/27/2018 09:31 PM, Ming Lei wrote:
>> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
> But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
> between timeout and other completions, such as cancel.
> Yes, I know blk_mq_complete_request could avoid the a request is accessed by timeout
path and other completion path concurrently. :)
What's I worry about is the timeout path could hold the expired request, so when
nvme_dev_disable return, we cannot ensure all the previous outstanding requests has been
handled. That's really bad. 

>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>> it has to be fixed by other way.)
> Maybe your approach looks a bit clean and simplify the implementation, but seems
> it isn't necessary.
> 
> So could you explain a bit what the exact issue you are worrying about? deadlock?
> or others?
There is indeed potential issue. But it is in very narrow window.
Please refer to https://lkml.org/lkml/2018/1/19/68

As you said, the approach looks a bit clean and simplify the implementation.
That's what I really want, break the complicated relationship between 
nvme_timeout and nvme_dev_diable. 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-27 14:29           ` jianchao.wang
@ 2018-01-27 15:44             ` Ming Lei
  -1 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-27 15:44 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Christoph Hellwig, Keith Busch, stable, Sagi Grimberg,
	linux-nvme, Xiao Liang

Hi jianchao,

On Sat, Jan 27, 2018 at 10:29:30PM +0800, jianchao.wang wrote:
> Hi ming 
> 
> Thanks for your detailed response.
> That's really appreciated.
> 
> On 01/27/2018 09:31 PM, Ming Lei wrote:
> >> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
> > But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
> > between timeout and other completions, such as cancel.
> > Yes, I know blk_mq_complete_request could avoid the a request is accessed by timeout
> path and other completion path concurrently. :)
> What's I worry about is the timeout path could hold the expired request, so when
> nvme_dev_disable return, we cannot ensure all the previous outstanding requests has been
> handled. That's really bad. 
> 
> >> The best way to fix this is to ensure the timeout path has been completed before cancel the
> >> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
> >> it has to be fixed by other way.)
> > Maybe your approach looks a bit clean and simplify the implementation, but seems
> > it isn't necessary.
> > 
> > So could you explain a bit what the exact issue you are worrying about? deadlock?
> > or others?
> There is indeed potential issue. But it is in very narrow window.
> Please refer to https://lkml.org/lkml/2018/1/19/68

OK, follows description from the above link:

>Yes, once controller disabling completes, any started requests will be
>handled and cannot expire. But before the _boundary_, there could be a
>nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>path grabs a request, then nvme_dev_disable cannot get and cancel it.
>
>So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>
>The worst case is :
>nvme_timeout                              nvme_reset_work
>if (ctrl->state == RESETTING )              nvme_dev_disable
>    nvme_dev_disable                        initializing procedure
>
>the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.

OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
disable queues again, and cause hang in nvme_reset_work().

Looks Keith's suggestion of introducing nvme_sync_queues() should be
enough to kill the race, but seems nvme_sync_queues() should have been
called at the entry of nvme_reset_work() unconditionally.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-27 15:44             ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2018-01-27 15:44 UTC (permalink / raw)


Hi jianchao,

On Sat, Jan 27, 2018@10:29:30PM +0800, jianchao.wang wrote:
> Hi ming 
> 
> Thanks for your detailed response.
> That's really appreciated.
> 
> On 01/27/2018 09:31 PM, Ming Lei wrote:
> >> But nvme_dev_disable may run with nvme_timeout in parallel or race with it.
> > But that doesn't mean it is a race, blk_mq_complete_request() can avoid race
> > between timeout and other completions, such as cancel.
> > Yes, I know blk_mq_complete_request could avoid the a request is accessed by timeout
> path and other completion path concurrently. :)
> What's I worry about is the timeout path could hold the expired request, so when
> nvme_dev_disable return, we cannot ensure all the previous outstanding requests has been
> handled. That's really bad. 
> 
> >> The best way to fix this is to ensure the timeout path has been completed before cancel the
> >> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
> >> it has to be fixed by other way.)
> > Maybe your approach looks a bit clean and simplify the implementation, but seems
> > it isn't necessary.
> > 
> > So could you explain a bit what the exact issue you are worrying about? deadlock?
> > or others?
> There is indeed potential issue. But it is in very narrow window.
> Please refer to https://lkml.org/lkml/2018/1/19/68

OK, follows description from the above link:

>Yes, once controller disabling completes, any started requests will be
>handled and cannot expire. But before the _boundary_, there could be a
>nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>path grabs a request, then nvme_dev_disable cannot get and cancel it.
>
>So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>
>The worst case is :
>nvme_timeout                              nvme_reset_work
>if (ctrl->state == RESETTING )              nvme_dev_disable
>    nvme_dev_disable                        initializing procedure
>
>the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.

OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
disable queues again, and cause hang in nvme_reset_work().

Looks Keith's suggestion of introducing nvme_sync_queues() should be
enough to kill the race, but seems nvme_sync_queues() should have been
called at the entry of nvme_reset_work() unconditionally.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
  2018-01-27 15:44             ` Ming Lei
@ 2018-01-28  9:01               ` jianchao.wang
  -1 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-28  9:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Xiao Liang, linux-nvme, Keith Busch, stable,
	Christoph Hellwig

Hi ming

Thanks for your time to look at this.
That's really appreciated.

On 01/27/2018 11:44 PM, Ming Lei wrote:
>>>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>>>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>>>> it has to be fixed by other way.)
>>> Maybe your approach looks a bit clean and simplify the implementation, but seems
>>> it isn't necessary.
>>>
>>> So could you explain a bit what the exact issue you are worrying about? deadlock?
>>> or others?
>> There is indeed potential issue. But it is in very narrow window.
>> Please refer to https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2018_1_19_68&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=vcQM1keq4TzxrKgWxIBOPADXTw7IRtGtWTBlHVIvPn8&s=83k_LmLl0TFLYwkVn4HjZbj5RgIMlLv570rnB5jNRC0&e=
> OK, follows description from the above link:
> 
>> Yes, once controller disabling completes, any started requests will be
>> handled and cannot expire. But before the _boundary_, there could be a
>> nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>> path grabs a request, then nvme_dev_disable cannot get and cancel it.
>>
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout                              nvme_reset_work
>> if (ctrl->state == RESETTING )              nvme_dev_disable
>>    nvme_dev_disable                        initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
> OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
> disable queues again, and cause hang in nvme_reset_work().

Yes. :)

> 
> Looks Keith's suggestion of introducing nvme_sync_queues() should be
> enough to kill the race, but seems nvme_sync_queues() should have been
> called at the entry of nvme_reset_work() unconditionally.

Thanks for your suggestion.

Jianchao

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
@ 2018-01-28  9:01               ` jianchao.wang
  0 siblings, 0 replies; 16+ messages in thread
From: jianchao.wang @ 2018-01-28  9:01 UTC (permalink / raw)


Hi ming

Thanks for your time to look at this.
That's really appreciated.

On 01/27/2018 11:44 PM, Ming Lei wrote:
>>>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>>>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>>>> it has to be fixed by other way.)
>>> Maybe your approach looks a bit clean and simplify the implementation, but seems
>>> it isn't necessary.
>>>
>>> So could you explain a bit what the exact issue you are worrying about? deadlock?
>>> or others?
>> There is indeed potential issue. But it is in very narrow window.
>> Please refer to https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2018_1_19_68&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=vcQM1keq4TzxrKgWxIBOPADXTw7IRtGtWTBlHVIvPn8&s=83k_LmLl0TFLYwkVn4HjZbj5RgIMlLv570rnB5jNRC0&e=
> OK, follows description from the above link:
> 
>> Yes, once controller disabling completes, any started requests will be
>> handled and cannot expire. But before the _boundary_, there could be a
>> nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>> path grabs a request, then nvme_dev_disable cannot get and cancel it.
>>
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout                              nvme_reset_work
>> if (ctrl->state == RESETTING )              nvme_dev_disable
>>    nvme_dev_disable                        initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
> OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
> disable queues again, and cause hang in nvme_reset_work().

Yes. :)

> 
> Looks Keith's suggestion of introducing nvme_sync_queues() should be
> enough to kill the race, but seems nvme_sync_queues() should have been
> called at the entry of nvme_reset_work() unconditionally.

Thanks for your suggestion.

Jianchao

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-01-28  9:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-25  8:10 [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED Ming Lei
2018-01-25  8:10 ` Ming Lei
2018-01-25  8:52 ` jianchao.wang
2018-01-25  8:52   ` jianchao.wang
2018-01-25 10:15   ` Ming Lei
2018-01-25 10:15     ` Ming Lei
2018-01-27 12:33     ` jianchao.wang
2018-01-27 12:33       ` jianchao.wang
2018-01-27 13:31       ` Ming Lei
2018-01-27 13:31         ` Ming Lei
2018-01-27 14:29         ` jianchao.wang
2018-01-27 14:29           ` jianchao.wang
2018-01-27 15:44           ` Ming Lei
2018-01-27 15:44             ` Ming Lei
2018-01-28  9:01             ` jianchao.wang
2018-01-28  9:01               ` jianchao.wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.