Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler
@ 2020-05-20 11:56 Ming Lei
  2020-05-20 11:56 ` [PATCH 1/3] blk-mq: add API of blk_mq_queue_frozen Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-20 11:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Ming Lei

Hi,

For nvme-pci, after controller is recovered, in-flight IOs are waited
before updating nr hw queues. If new controller error happens during
this period, nvme-pci driver deletes the controller and fails in-flight
IO. This way is too violent, and not friendly from user viewpoint.

Add APIs for checking if queue is frozen, and replace nvme_wait_freeze
in nvme-pci reset handler with checking if all ns queues are frozen &
controller disabled. Then a fresh new reset can be scheduled for
handling new controller error during waiting for in-flight IO completion.

So deleting controller & failing IOs can be avoided in this situation.

Without this patches, when fail io timeout injection is run, the
controller can be removed very quickly. With this patch, no controller
removing can be observed, and controller can recover to normal state
after stopping to inject io timeout failure.

Ming Lei (3):
  blk-mq: add API of blk_mq_queue_frozen
  nvme: add nvme_frozen
  nvme-pci: make nvme reset more reliable

 block/blk-mq.c           |  6 ++++++
 drivers/nvme/host/core.c | 14 ++++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 37 ++++++++++++++++++++++++++++++-------
 include/linux/blk-mq.h   |  1 +
 5 files changed, 52 insertions(+), 7 deletions(-)

-- 
2.25.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] blk-mq: add API of blk_mq_queue_frozen
  2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
@ 2020-05-20 11:56 ` Ming Lei
  2020-05-20 11:56 ` [PATCH 2/3] nvme: add nvme_frozen Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-20 11:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Ming Lei, Sagi Grimberg, Keith Busch, Max Gurtovoy

blk_mq_freeze_queue_wait() isn't very flexible for some case, such as
error recovery: when blk_mq_freeze_queue_wait is called in error
recovery handler, new problem may be triggered on this controller, so
in-flight IO may not complete when blk_mq_freeze_queue_wait() is called.

And error recovery is often run in single context, so dead lock is
triggered, because error recover handler can't move on.

Add one new API of blk_mq_queue_frozen(), error recovery handler may
use this helper to query if the queue has been frozen completely.
Meantime, the error recovery handler can check if there is hardware
failure happened. If yes, error recovery handler can break from current
handling, and run a fresh new recovery, so deadlock can be avoided.

This API will be used to improve error handling of nvme-pci's timeout
handler.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 6 ++++++
 include/linux/blk-mq.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index cac11945f602..e595951bcdae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -148,6 +148,12 @@ void blk_mq_freeze_queue_wait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
+bool blk_mq_queue_frozen(struct request_queue *q)
+{
+	return percpu_ref_is_zero(&q->q_usage_counter);
+}
+EXPORT_SYMBOL_GPL(blk_mq_queue_frozen);
+
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a..e1d57202d526 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -518,6 +518,7 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
+bool blk_mq_queue_frozen(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.25.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] nvme: add nvme_frozen
  2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
  2020-05-20 11:56 ` [PATCH 1/3] blk-mq: add API of blk_mq_queue_frozen Ming Lei
@ 2020-05-20 11:56 ` Ming Lei
  2020-05-20 11:56 ` [PATCH 3/3] nvme-pci: make nvme reset more reliable Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-20 11:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Ming Lei, Sagi Grimberg, Keith Busch, Max Gurtovoy

Add one new API of nvme_frozen(), reset handler may use this helper to
query if all ns queues have been frozen completely. Meantime, the reset
handler can check if there is new hardware failure happened. If yes, reset
handler can break from current handling, and schedule a fresh new recovery,
so deadlock or deleting controller & fail all IOs can be avoided.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 14 ++++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3c037f5a9ba..469010607383 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4243,6 +4243,20 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
 
+bool nvme_frozen(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+	int ret = 0;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		ret += !blk_mq_queue_frozen(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+
+	return ret == 0;
+}
+EXPORT_SYMBOL_GPL(nvme_frozen);
+
 void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2e04a36296d9..459e5952ff5f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -508,6 +508,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
+bool nvme_frozen(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
-- 
2.25.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
  2020-05-20 11:56 ` [PATCH 1/3] blk-mq: add API of blk_mq_queue_frozen Ming Lei
  2020-05-20 11:56 ` [PATCH 2/3] nvme: add nvme_frozen Ming Lei
@ 2020-05-20 11:56 ` Ming Lei
  2020-05-20 17:10   ` Dongli Zhang
  2020-05-26  5:01   ` Dongli Zhang
  2020-05-26  2:55 ` [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
  2020-05-27 18:09 ` Alan Adamson
  4 siblings, 2 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-20 11:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Ming Lei, Sagi Grimberg, Keith Busch, Max Gurtovoy

During waiting for in-flight IO completion in reset handler, timeout
or controller failure still may happen, then the controller is deleted
and all inflight IOs are failed. This way is too violent.

Improve the reset handling by replacing nvme_wait_freeze with query
& check controller. If all ns queues are frozen, the controller is reset
successfully, otherwise check and see if the controller has been disabled.
If yes, break from the current recovery and schedule a fresh new reset.

This way avoids to failing IO & removing controller unnecessarily.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ce0d1e79467a..b5aeed33a634 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/delay.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 * shutdown, so we return BLK_EH_DONE.
 	 */
 	switch (dev->ctrl.state) {
-	case NVME_CTRL_CONNECTING:
-		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
-		/* fall through */
 	case NVME_CTRL_DELETING:
 		dev_warn_ratelimited(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
@@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		    dev->ctrl.state == NVME_CTRL_RESETTING ||
+		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
 			freeze = true;
 			nvme_start_freeze(&dev->ctrl);
 		}
@@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
+static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
+{
+	bool frozen;
+
+	while (true) {
+		frozen = nvme_frozen(&dev->ctrl);
+		if (frozen)
+			break;
+		if (!dev->online_queues)
+			break;
+		msleep(5);
+	}
+
+	return frozen;
+}
+
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev =
 		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result;
+	bool reset_done = true;
 
 	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
 		result = -ENODEV;
@@ -2606,8 +2622,9 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_free_tagset(dev);
 	} else {
 		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
-		nvme_dev_add(dev);
+		reset_done = nvme_wait_freeze_and_check(dev);
+		if (reset_done)
+			nvme_dev_add(dev);
 		nvme_unfreeze(&dev->ctrl);
 	}
 
@@ -2622,7 +2639,13 @@ static void nvme_reset_work(struct work_struct *work)
 		goto out;
 	}
 
-	nvme_start_ctrl(&dev->ctrl);
+	/* New error happens during reset, so schedule a new reset */
+	if (!reset_done) {
+		dev_warn(dev->ctrl.device, "new error during reset\n");
+		nvme_reset_ctrl(&dev->ctrl);
+	} else {
+		nvme_start_ctrl(&dev->ctrl);
+	}
 	return;
 
  out_unlock:
-- 
2.25.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 11:56 ` [PATCH 3/3] nvme-pci: make nvme reset more reliable Ming Lei
@ 2020-05-20 17:10   ` Dongli Zhang
  2020-05-20 17:27     ` Dongli Zhang
                       ` (2 more replies)
  2020-05-26  5:01   ` Dongli Zhang
  1 sibling, 3 replies; 14+ messages in thread
From: Dongli Zhang @ 2020-05-20 17:10 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Sagi Grimberg, Keith Busch, Max Gurtovoy



On 5/20/20 4:56 AM, Ming Lei wrote:
> During waiting for in-flight IO completion in reset handler, timeout
> or controller failure still may happen, then the controller is deleted
> and all inflight IOs are failed. This way is too violent.
> 
> Improve the reset handling by replacing nvme_wait_freeze with query
> & check controller. If all ns queues are frozen, the controller is reset
> successfully, otherwise check and see if the controller has been disabled.
> If yes, break from the current recovery and schedule a fresh new reset.
> 
> This way avoids to failing IO & removing controller unnecessarily.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Max Gurtovoy <maxg@mellanox.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index ce0d1e79467a..b5aeed33a634 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -24,6 +24,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
> +#include <linux/delay.h>
>  
>  #include "trace.h"
>  #include "nvme.h"
> @@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 * shutdown, so we return BLK_EH_DONE.
>  	 */
>  	switch (dev->ctrl.state) {
> -	case NVME_CTRL_CONNECTING:
> -		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> -		/* fall through */
>  	case NVME_CTRL_DELETING:
>  		dev_warn_ratelimited(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
> @@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
>  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> +		    dev->ctrl.state == NVME_CTRL_RESETTING ||
> +		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
>  			freeze = true;
>  			nvme_start_freeze(&dev->ctrl);
>  		}
> @@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>  		nvme_put_ctrl(&dev->ctrl);
>  }
>  
> +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
> +{
> +	bool frozen;
> +
> +	while (true) {
> +		frozen = nvme_frozen(&dev->ctrl);
> +		if (frozen)
> +			break;
> +		if (!dev->online_queues)
> +			break;
> +		msleep(5);
> +	}
> +
> +	return frozen;
> +}
> +
>  static void nvme_reset_work(struct work_struct *work)
>  {
>  	struct nvme_dev *dev =
>  		container_of(work, struct nvme_dev, ctrl.reset_work);
>  	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
>  	int result;
> +	bool reset_done = true;
>  
>  	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
>  		result = -ENODEV;
> @@ -2606,8 +2622,9 @@ static void nvme_reset_work(struct work_struct *work)
>  		nvme_free_tagset(dev);
>  	} else {
>  		nvme_start_queues(&dev->ctrl);
> -		nvme_wait_freeze(&dev->ctrl);
> -		nvme_dev_add(dev);
> +		reset_done = nvme_wait_freeze_and_check(dev);

Once we arrive at here, it indicates "dev->online_queues >= 2".

2601         if (dev->online_queues < 2) {
2602                 dev_warn(dev->ctrl.device, "IO queues not created\n");
2603                 nvme_kill_queues(&dev->ctrl);
2604                 nvme_remove_namespaces(&dev->ctrl);
2605                 nvme_free_tagset(dev);
2606         } else {
2607                 nvme_start_queues(&dev->ctrl);
2608                 nvme_wait_freeze(&dev->ctrl);
2609                 nvme_dev_add(dev);
2610                 nvme_unfreeze(&dev->ctrl);
2611         }

Is there any reason to check "if (!dev->online_queues)" in
nvme_wait_freeze_and_check()?

Thank you very much!

Dongli Zhang




> +		if (reset_done)
> +			nvme_dev_add(dev);
>  		nvme_unfreeze(&dev->ctrl);
>  	}
>  
> @@ -2622,7 +2639,13 @@ static void nvme_reset_work(struct work_struct *work)
>  		goto out;
>  	}
>  
> -	nvme_start_ctrl(&dev->ctrl);
> +	/* New error happens during reset, so schedule a new reset */
> +	if (!reset_done) {
> +		dev_warn(dev->ctrl.device, "new error during reset\n");
> +		nvme_reset_ctrl(&dev->ctrl);
> +	} else {
> +		nvme_start_ctrl(&dev->ctrl);
> +	}
>  	return;
>  
>   out_unlock:
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 17:10   ` Dongli Zhang
@ 2020-05-20 17:27     ` Dongli Zhang
  2020-05-20 17:52     ` Keith Busch
  2020-05-21  2:33     ` Ming Lei
  2 siblings, 0 replies; 14+ messages in thread
From: Dongli Zhang @ 2020-05-20 17:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Sagi Grimberg, Keith Busch, Max Gurtovoy



On 5/20/20 10:10 AM, Dongli Zhang wrote:
> 
> 
> On 5/20/20 4:56 AM, Ming Lei wrote:
>> During waiting for in-flight IO completion in reset handler, timeout
>> or controller failure still may happen, then the controller is deleted
>> and all inflight IOs are failed. This way is too violent.
>>
>> Improve the reset handling by replacing nvme_wait_freeze with query
>> & check controller. If all ns queues are frozen, the controller is reset
>> successfully, otherwise check and see if the controller has been disabled.
>> If yes, break from the current recovery and schedule a fresh new reset.
>>
>> This way avoids to failing IO & removing controller unnecessarily.
>>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Sagi Grimberg <sagi@grimberg.me>
>> Cc: Keith Busch <kbusch@kernel.org>
>> Cc: Max Gurtovoy <maxg@mellanox.com>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> ---
>>  drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
>>  1 file changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index ce0d1e79467a..b5aeed33a634 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -24,6 +24,7 @@
>>  #include <linux/io-64-nonatomic-lo-hi.h>
>>  #include <linux/sed-opal.h>
>>  #include <linux/pci-p2pdma.h>
>> +#include <linux/delay.h>
>>  
>>  #include "trace.h"
>>  #include "nvme.h"
>> @@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>>  	 * shutdown, so we return BLK_EH_DONE.
>>  	 */
>>  	switch (dev->ctrl.state) {
>> -	case NVME_CTRL_CONNECTING:
>> -		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>> -		/* fall through */
>>  	case NVME_CTRL_DELETING:
>>  		dev_warn_ratelimited(dev->ctrl.device,
>>  			 "I/O %d QID %d timeout, disable controller\n",
>> @@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>>  
>>  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
>> -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
>> +		    dev->ctrl.state == NVME_CTRL_RESETTING ||
>> +		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
>>  			freeze = true;
>>  			nvme_start_freeze(&dev->ctrl);
>>  		}
>> @@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>>  		nvme_put_ctrl(&dev->ctrl);
>>  }
>>  
>> +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
>> +{
>> +	bool frozen;
>> +
>> +	while (true) {
>> +		frozen = nvme_frozen(&dev->ctrl);
>> +		if (frozen)
>> +			break;
>> +		if (!dev->online_queues)
>> +			break;
>> +		msleep(5);
>> +	}
>> +
>> +	return frozen;
>> +}
>> +
>>  static void nvme_reset_work(struct work_struct *work)
>>  {
>>  	struct nvme_dev *dev =
>>  		container_of(work, struct nvme_dev, ctrl.reset_work);
>>  	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
>>  	int result;
>> +	bool reset_done = true;
>>  
>>  	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
>>  		result = -ENODEV;
>> @@ -2606,8 +2622,9 @@ static void nvme_reset_work(struct work_struct *work)
>>  		nvme_free_tagset(dev);
>>  	} else {
>>  		nvme_start_queues(&dev->ctrl);
>> -		nvme_wait_freeze(&dev->ctrl);
>> -		nvme_dev_add(dev);
>> +		reset_done = nvme_wait_freeze_and_check(dev);
> 
> Once we arrive at here, it indicates "dev->online_queues >= 2".
> 
> 2601         if (dev->online_queues < 2) {
> 2602                 dev_warn(dev->ctrl.device, "IO queues not created\n");
> 2603                 nvme_kill_queues(&dev->ctrl);
> 2604                 nvme_remove_namespaces(&dev->ctrl);
> 2605                 nvme_free_tagset(dev);
> 2606         } else {
> 2607                 nvme_start_queues(&dev->ctrl);
> 2608                 nvme_wait_freeze(&dev->ctrl);
> 2609                 nvme_dev_add(dev);
> 2610                 nvme_unfreeze(&dev->ctrl);
> 2611         }
> 
> Is there any reason to check "if (!dev->online_queues)" in
> nvme_wait_freeze_and_check()?
> 

I think you meant another nvme_dev_disable() during reset?

Sorry for the misunderstanding in previous email.

Dongli Zhang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 17:10   ` Dongli Zhang
  2020-05-20 17:27     ` Dongli Zhang
@ 2020-05-20 17:52     ` Keith Busch
  2020-05-21  2:33     ` Ming Lei
  2 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2020-05-20 17:52 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Alan Adamson, Sagi Grimberg, Max Gurtovoy

On Wed, May 20, 2020 at 10:10:47AM -0700, Dongli Zhang wrote:
> On 5/20/20 4:56 AM, Ming Lei wrote:
> > +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
> > +{
> > +	bool frozen;
> > +
> > +	while (true) {
> > +		frozen = nvme_frozen(&dev->ctrl);
> > +		if (frozen)
> > +			break;
> > +		if (!dev->online_queues)
> > +			break;
> > +		msleep(5);
> > +	}
> > +
> > +	return frozen;
> > +}
> > +
> >  static void nvme_reset_work(struct work_struct *work)
> >  {
> >  	struct nvme_dev *dev =
> >  		container_of(work, struct nvme_dev, ctrl.reset_work);
> >  	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
> >  	int result;
> > +	bool reset_done = true;
> >  
> >  	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
> >  		result = -ENODEV;
> > @@ -2606,8 +2622,9 @@ static void nvme_reset_work(struct work_struct *work)
> >  		nvme_free_tagset(dev);
> >  	} else {
> >  		nvme_start_queues(&dev->ctrl);
> > -		nvme_wait_freeze(&dev->ctrl);
> > -		nvme_dev_add(dev);
> > +		reset_done = nvme_wait_freeze_and_check(dev);
> 
> Once we arrive at here, it indicates "dev->online_queues >= 2".
> 
> 2601         if (dev->online_queues < 2) {
> 2602                 dev_warn(dev->ctrl.device, "IO queues not created\n");
> 2603                 nvme_kill_queues(&dev->ctrl);
> 2604                 nvme_remove_namespaces(&dev->ctrl);
> 2605                 nvme_free_tagset(dev);
> 2606         } else {
> 2607                 nvme_start_queues(&dev->ctrl);
> 2608                 nvme_wait_freeze(&dev->ctrl);
> 2609                 nvme_dev_add(dev);
> 2610                 nvme_unfreeze(&dev->ctrl);
> 2611         }
> 
> Is there any reason to check "if (!dev->online_queues)" in
> nvme_wait_freeze_and_check()?

Looks correct to me. If the queues fail to freeze, that means a timeout
occured, and the nvme timeout handler tears down all online queues, so
this patch uses that for the criteria to break out of the loop. 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 17:10   ` Dongli Zhang
  2020-05-20 17:27     ` Dongli Zhang
  2020-05-20 17:52     ` Keith Busch
@ 2020-05-21  2:33     ` Ming Lei
  2 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-21  2:33 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Alan Adamson, Sagi Grimberg, Keith Busch, Max Gurtovoy

On Wed, May 20, 2020 at 10:10:47AM -0700, Dongli Zhang wrote:
> 
> 
> On 5/20/20 4:56 AM, Ming Lei wrote:
> > During waiting for in-flight IO completion in reset handler, timeout
> > or controller failure still may happen, then the controller is deleted
> > and all inflight IOs are failed. This way is too violent.
> > 
> > Improve the reset handling by replacing nvme_wait_freeze with query
> > & check controller. If all ns queues are frozen, the controller is reset
> > successfully, otherwise check and see if the controller has been disabled.
> > If yes, break from the current recovery and schedule a fresh new reset.
> > 
> > This way avoids to failing IO & removing controller unnecessarily.
> > 
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: Max Gurtovoy <maxg@mellanox.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
> >  1 file changed, 30 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index ce0d1e79467a..b5aeed33a634 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/sed-opal.h>
> >  #include <linux/pci-p2pdma.h>
> > +#include <linux/delay.h>
> >  
> >  #include "trace.h"
> >  #include "nvme.h"
> > @@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	 * shutdown, so we return BLK_EH_DONE.
> >  	 */
> >  	switch (dev->ctrl.state) {
> > -	case NVME_CTRL_CONNECTING:
> > -		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> > -		/* fall through */
> >  	case NVME_CTRL_DELETING:
> >  		dev_warn_ratelimited(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> > @@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> >  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> > -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> > +		    dev->ctrl.state == NVME_CTRL_RESETTING ||
> > +		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
> >  			freeze = true;
> >  			nvme_start_freeze(&dev->ctrl);
> >  		}
> > @@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
> >  		nvme_put_ctrl(&dev->ctrl);
> >  }
> >  
> > +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
> > +{
> > +	bool frozen;
> > +
> > +	while (true) {
> > +		frozen = nvme_frozen(&dev->ctrl);
> > +		if (frozen)
> > +			break;
> > +		if (!dev->online_queues)
> > +			break;
> > +		msleep(5);
> > +	}
> > +
> > +	return frozen;
> > +}
> > +
> >  static void nvme_reset_work(struct work_struct *work)
> >  {
> >  	struct nvme_dev *dev =
> >  		container_of(work, struct nvme_dev, ctrl.reset_work);
> >  	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
> >  	int result;
> > +	bool reset_done = true;
> >  
> >  	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) {
> >  		result = -ENODEV;
> > @@ -2606,8 +2622,9 @@ static void nvme_reset_work(struct work_struct *work)
> >  		nvme_free_tagset(dev);
> >  	} else {
> >  		nvme_start_queues(&dev->ctrl);
> > -		nvme_wait_freeze(&dev->ctrl);
> > -		nvme_dev_add(dev);
> > +		reset_done = nvme_wait_freeze_and_check(dev);
> 
> Once we arrive at here, it indicates "dev->online_queues >= 2".
> 
> 2601         if (dev->online_queues < 2) {
> 2602                 dev_warn(dev->ctrl.device, "IO queues not created\n");
> 2603                 nvme_kill_queues(&dev->ctrl);
> 2604                 nvme_remove_namespaces(&dev->ctrl);
> 2605                 nvme_free_tagset(dev);
> 2606         } else {
> 2607                 nvme_start_queues(&dev->ctrl);
> 2608                 nvme_wait_freeze(&dev->ctrl);
> 2609                 nvme_dev_add(dev);
> 2610                 nvme_unfreeze(&dev->ctrl);
> 2611         }
> 
> Is there any reason to check "if (!dev->online_queues)" in
> nvme_wait_freeze_and_check()?
> 

nvme_dev_disable() suspends all io queues and admin queue, so dev->online_queues
will become 0 after nvme_dev_disable() is run from timeout handler.


thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler
  2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
                   ` (2 preceding siblings ...)
  2020-05-20 11:56 ` [PATCH 3/3] nvme-pci: make nvme reset more reliable Ming Lei
@ 2020-05-26  2:55 ` Ming Lei
  2020-05-27 18:09 ` Alan Adamson
  4 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-26  2:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig; +Cc: Alan Adamson

On Wed, May 20, 2020 at 07:56:52PM +0800, Ming Lei wrote:
> Hi,
> 
> For nvme-pci, after controller is recovered, in-flight IOs are waited
> before updating nr hw queues. If new controller error happens during
> this period, nvme-pci driver deletes the controller and fails in-flight
> IO. This way is too violent, and not friendly from user viewpoint.
> 
> Add APIs for checking if queue is frozen, and replace nvme_wait_freeze
> in nvme-pci reset handler with checking if all ns queues are frozen &
> controller disabled. Then a fresh new reset can be scheduled for
> handling new controller error during waiting for in-flight IO completion.
> 
> So deleting controller & failing IOs can be avoided in this situation.
> 
> Without this patches, when fail io timeout injection is run, the
> controller can be removed very quickly. With this patch, no controller
> removing can be observed, and controller can recover to normal state
> after stopping to inject io timeout failure.
> 
> Ming Lei (3):
>   blk-mq: add API of blk_mq_queue_frozen
>   nvme: add nvme_frozen
>   nvme-pci: make nvme reset more reliable
> 
>  block/blk-mq.c           |  6 ++++++
>  drivers/nvme/host/core.c | 14 ++++++++++++++
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  | 37 ++++++++++++++++++++++++++++++-------
>  include/linux/blk-mq.h   |  1 +
>  5 files changed, 52 insertions(+), 7 deletions(-)
> 
> -- 
> 2.25.2
> 

Hello Guys,

Ping...

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-20 11:56 ` [PATCH 3/3] nvme-pci: make nvme reset more reliable Ming Lei
  2020-05-20 17:10   ` Dongli Zhang
@ 2020-05-26  5:01   ` Dongli Zhang
  2020-05-26  7:12     ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Dongli Zhang @ 2020-05-26  5:01 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Alan Adamson, Sagi Grimberg, Keith Busch, Max Gurtovoy



On 5/20/20 4:56 AM, Ming Lei wrote:
> During waiting for in-flight IO completion in reset handler, timeout

Does this indicate the window since nvme_start_queues() in nvme_reset_work(),
that is, just after the queues are unquiesced again?

If v2 is required in the future, how about to mention the specific function to
that it would be much more easier to track the issue.

> or controller failure still may happen, then the controller is deleted
> and all inflight IOs are failed. This way is too violent.
> 
> Improve the reset handling by replacing nvme_wait_freeze with query
> & check controller. If all ns queues are frozen, the controller is reset
> successfully, otherwise check and see if the controller has been disabled.
> If yes, break from the current recovery and schedule a fresh new reset.
> 
> This way avoids to failing IO & removing controller unnecessarily.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Max Gurtovoy <maxg@mellanox.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index ce0d1e79467a..b5aeed33a634 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -24,6 +24,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
> +#include <linux/delay.h>
>  
>  #include "trace.h"
>  #include "nvme.h"
> @@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 * shutdown, so we return BLK_EH_DONE.
>  	 */
>  	switch (dev->ctrl.state) {
> -	case NVME_CTRL_CONNECTING:
> -		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> -		/* fall through */
>  	case NVME_CTRL_DELETING:
>  		dev_warn_ratelimited(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
> @@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
>  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> +		    dev->ctrl.state == NVME_CTRL_RESETTING ||
> +		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
>  			freeze = true;
>  			nvme_start_freeze(&dev->ctrl);
>  		}
> @@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
>  		nvme_put_ctrl(&dev->ctrl);
>  }
>  
> +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
> +{
> +	bool frozen;
> +
> +	while (true) {
> +		frozen = nvme_frozen(&dev->ctrl);
> +		if (frozen)
> +			break;

... and how about to comment that the below is because of nvme timeout handler
as explained in another email (if v2 would be sent) so that it is not required
to query for "online_queues" with cscope :)

> +		if (!dev->online_queues)
> +			break;
> +		msleep(5);
> +	}
> +
> +	return frozen;
> +}
> +

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] nvme-pci: make nvme reset more reliable
  2020-05-26  5:01   ` Dongli Zhang
@ 2020-05-26  7:12     ` Ming Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-26  7:12 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Alan Adamson, Sagi Grimberg, Keith Busch, Max Gurtovoy

On Mon, May 25, 2020 at 10:01:18PM -0700, Dongli Zhang wrote:
> 
> 
> On 5/20/20 4:56 AM, Ming Lei wrote:
> > During waiting for in-flight IO completion in reset handler, timeout
> 
> Does this indicate the window since nvme_start_queues() in nvme_reset_work(),
> that is, just after the queues are unquiesced again?

Right, nvme_start_queues() starts to dispatch requests again, and
nvme_wait_freeze() waits completion of all these in-flight IOs.

> 
> If v2 is required in the future, how about to mention the specific function to
> that it would be much more easier to track the issue.

Not sure it is needed, cause it is quite straightforward.

> 
> > or controller failure still may happen, then the controller is deleted
> > and all inflight IOs are failed. This way is too violent.
> > 
> > Improve the reset handling by replacing nvme_wait_freeze with query
> > & check controller. If all ns queues are frozen, the controller is reset
> > successfully, otherwise check and see if the controller has been disabled.
> > If yes, break from the current recovery and schedule a fresh new reset.
> > 
> > This way avoids to failing IO & removing controller unnecessarily.
> > 
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Cc: Max Gurtovoy <maxg@mellanox.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++++++-------
> >  1 file changed, 30 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index ce0d1e79467a..b5aeed33a634 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> >  #include <linux/sed-opal.h>
> >  #include <linux/pci-p2pdma.h>
> > +#include <linux/delay.h>
> >  
> >  #include "trace.h"
> >  #include "nvme.h"
> > @@ -1235,9 +1236,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	 * shutdown, so we return BLK_EH_DONE.
> >  	 */
> >  	switch (dev->ctrl.state) {
> > -	case NVME_CTRL_CONNECTING:
> > -		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> > -		/* fall through */
> >  	case NVME_CTRL_DELETING:
> >  		dev_warn_ratelimited(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> > @@ -2393,7 +2391,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> >  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> > -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> > +		    dev->ctrl.state == NVME_CTRL_RESETTING ||
> > +		    dev->ctrl.state == NVME_CTRL_CONNECTING) {
> >  			freeze = true;
> >  			nvme_start_freeze(&dev->ctrl);
> >  		}
> > @@ -2504,12 +2503,29 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
> >  		nvme_put_ctrl(&dev->ctrl);
> >  }
> >  
> > +static bool nvme_wait_freeze_and_check(struct nvme_dev *dev)
> > +{
> > +	bool frozen;
> > +
> > +	while (true) {
> > +		frozen = nvme_frozen(&dev->ctrl);
> > +		if (frozen)
> > +			break;
> 
> ... and how about to comment that the below is because of nvme timeout handler
> as explained in another email (if v2 would be sent) so that it is not required
> to query for "online_queues" with cscope :)
> 
> > +		if (!dev->online_queues)
> > +			break;
> > +		msleep(5);

Fine.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler
  2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
                   ` (3 preceding siblings ...)
  2020-05-26  2:55 ` [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
@ 2020-05-27 18:09 ` Alan Adamson
  2020-05-27 18:52   ` Keith Busch
  2020-05-28  1:36   ` Ming Lei
  4 siblings, 2 replies; 14+ messages in thread
From: Alan Adamson @ 2020-05-27 18:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig

I tested this patch against a timeout test I've been working with and 
I'm getting a hang.

# cat block-err.sh
set -x
echo 100 > /sys/kernel/debug/fail_io_timeout/probability
echo 1000 > /sys/kernel/debug/fail_io_timeout/times
echo 1 > /sys/block/nvme0n1/io-timeout-fail
dd if=/dev/nvme0n1 of=/dev/null bs=512 count=1


# sh  block-err.sh
+ echo 100
+ echo 1000
+ echo 1
+ dd if=/dev/nvme0n1 of=/dev/null bs=512 count=1

**** Hang ****

# dmesg
.
.
.
[   79.403253] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 1, probability 100, space 
0, times 1000
[   79.403255] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.7.0-rc7+ #1
[   79.403256] Hardware name: Oracle Corporation ORACLE SERVER 
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[   79.403257] Call Trace:
[   79.403259]  <IRQ>
[   79.403267]  dump_stack+0x6d/0x9a
[   79.403270]  should_fail.cold.5+0x32/0x42
[   79.403273]  blk_should_fake_timeout+0x26/0x30
[   79.403275]  blk_mq_complete_request+0x1b/0x120
[   79.403280]  nvme_irq+0xd9/0x1f0 [nvme]
[   79.403287]  __handle_irq_event_percpu+0x44/0x190
[   79.403288]  handle_irq_event_percpu+0x32/0x80
[   79.403290]  handle_irq_event+0x3b/0x5a
[   79.403291]  handle_edge_irq+0x87/0x190
[   79.403296]  do_IRQ+0x54/0xe0
[   79.403299]  common_interrupt+0xf/0xf
[   79.403300]  </IRQ>
[   79.403305] RIP: 0010:cpuidle_enter_state+0xc1/0x400
[   79.403307] Code: ff e8 e3 41 93 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 
00 00 f6 c4 02 0f 85 d2 02 00 00 31 ff e8 16 c3 99 ff fb 66 0f 1f 44 00 
00 <45> 85 e4 0f 88 3d 02 00 00 49 63 c4 48 8d 14 40 48 8d 0c c5 00 00
[   79.403308] RSP: 0018:ffffb97e8c54be40 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffffdd
[   79.403309] RAX: ffff9781bf76cc40 RBX: ffffd95e7f743200 RCX: 
000000000000001f
[   79.403310] RDX: 000000127ccd6e6c RSI: 0000000031573862 RDI: 
0000000000000000
[   79.403310] RBP: ffffb97e8c54be80 R08: 0000000000000002 R09: 
000000000002c4c0
[   79.403311] R10: 011b921e580bc454 R11: ffff9781bf76bb44 R12: 
0000000000000002
[   79.403311] R13: ffffffffbd14c120 R14: ffffffffbd14c208 R15: 
ffffffffbd14c1f0
[   79.403314]  cpuidle_enter+0x2e/0x40
[   79.403318]  call_cpuidle+0x23/0x40
[   79.403319]  do_idle+0x230/0x270
[   79.403320]  cpu_startup_entry+0x1d/0x20
[   79.403325]  start_secondary+0x170/0x1c0
[   79.403329]  secondary_startup_64+0xb6/0xc0
[  109.674334] nvme nvme0: I/O 754 QID 34 timeout, aborting
[  109.674395] nvme nvme0: Abort status: 0x0
[  139.879453] nvme nvme0: I/O 754 QID 34 timeout, reset controller
[  139.895263] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 1, probability 100, space 
0, times 999
[  139.895265] CPU: 5 PID: 2470 Comm: kworker/5:1H Not tainted 5.7.0-rc7+ #1
[  139.895266] Hardware name: Oracle Corporation ORACLE SERVER 
X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
[  139.895271] Workqueue: kblockd blk_mq_timeout_work
[  139.895272] Call Trace:
[  139.895279]  dump_stack+0x6d/0x9a
[  139.895281]  should_fail.cold.5+0x32/0x42
[  139.895282]  blk_should_fake_timeout+0x26/0x30
[  139.895283]  blk_mq_complete_request+0x1b/0x120
[  139.895292]  nvme_cancel_request+0x33/0x80 [nvme_core]
[  139.895296]  bt_tags_iter+0x48/0x50
[  139.895297]  blk_mq_tagset_busy_iter+0x1eb/0x270
[  139.895299]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
[  139.895301]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
[  139.895305]  nvme_dev_disable+0x2be/0x460 [nvme]
[  139.895307]  nvme_timeout.cold.80+0x9c/0x182 [nvme]
[  139.895311]  ? sched_clock+0x9/0x10
[  139.895315]  ? sched_clock_cpu+0x11/0xc0
[  139.895320]  ? __switch_to_asm+0x40/0x70
[  139.895321]  blk_mq_check_expired+0x192/0x1b0
[  139.895322]  bt_iter+0x52/0x60
[  139.895323]  blk_mq_queue_tag_busy_iter+0x1a0/0x2e0
[  139.895325]  ? __switch_to_asm+0x40/0x70
[  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
[  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
[  139.895329]  ? compat_start_thread+0x20/0x40
[  139.895330]  blk_mq_timeout_work+0x5a/0x130
[  139.895333]  process_one_work+0x1ab/0x380
[  139.895334]  worker_thread+0x37/0x3b0
[  139.895335]  kthread+0x120/0x140
[  139.895337]  ? create_worker+0x1b0/0x1b0
[  139.895337]  ? kthread_park+0x90/0x90
[  139.895339]  ret_from_fork+0x35/0x40
[  139.897859] nvme nvme0: Shutdown timeout set to 10 seconds
[  139.901186] nvme nvme0: 56/0/0 default/read/poll queues

On 5/20/20 4:56 AM, Ming Lei wrote:
> Hi,
>
> For nvme-pci, after controller is recovered, in-flight IOs are waited
> before updating nr hw queues. If new controller error happens during
> this period, nvme-pci driver deletes the controller and fails in-flight
> IO. This way is too violent, and not friendly from user viewpoint.
>
> Add APIs for checking if queue is frozen, and replace nvme_wait_freeze
> in nvme-pci reset handler with checking if all ns queues are frozen &
> controller disabled. Then a fresh new reset can be scheduled for
> handling new controller error during waiting for in-flight IO completion.
>
> So deleting controller & failing IOs can be avoided in this situation.
>
> Without this patches, when fail io timeout injection is run, the
> controller can be removed very quickly. With this patch, no controller
> removing can be observed, and controller can recover to normal state
> after stopping to inject io timeout failure.
>
> Ming Lei (3):
>    blk-mq: add API of blk_mq_queue_frozen
>    nvme: add nvme_frozen
>    nvme-pci: make nvme reset more reliable
>
>   block/blk-mq.c           |  6 ++++++
>   drivers/nvme/host/core.c | 14 ++++++++++++++
>   drivers/nvme/host/nvme.h |  1 +
>   drivers/nvme/host/pci.c  | 37 ++++++++++++++++++++++++++++++-------
>   include/linux/blk-mq.h   |  1 +
>   5 files changed, 52 insertions(+), 7 deletions(-)
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler
  2020-05-27 18:09 ` Alan Adamson
@ 2020-05-27 18:52   ` Keith Busch
  2020-05-28  1:36   ` Ming Lei
  1 sibling, 0 replies; 14+ messages in thread
From: Keith Busch @ 2020-05-27 18:52 UTC (permalink / raw)
  To: Alan Adamson
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig

On Wed, May 27, 2020 at 11:09:53AM -0700, Alan Adamson wrote:
> [  139.895265] CPU: 5 PID: 2470 Comm: kworker/5:1H Not tainted 5.7.0-rc7+ #1
> [  139.895266] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [  139.895271] Workqueue: kblockd blk_mq_timeout_work
> [  139.895272] Call Trace:
> [  139.895279]  dump_stack+0x6d/0x9a
> [  139.895281]  should_fail.cold.5+0x32/0x42
> [  139.895282]  blk_should_fake_timeout+0x26/0x30
> [  139.895283]  blk_mq_complete_request+0x1b/0x120
> [  139.895292]  nvme_cancel_request+0x33/0x80 [nvme_core]
> [  139.895296]  bt_tags_iter+0x48/0x50
> [  139.895297]  blk_mq_tagset_busy_iter+0x1eb/0x270
> [  139.895299]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
> [  139.895301]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
> [  139.895305]  nvme_dev_disable+0x2be/0x460 [nvme]
> [  139.895307]  nvme_timeout.cold.80+0x9c/0x182 [nvme]
> [  139.895311]  ? sched_clock+0x9/0x10
> [  139.895315]  ? sched_clock_cpu+0x11/0xc0
> [  139.895320]  ? __switch_to_asm+0x40/0x70
> [  139.895321]  blk_mq_check_expired+0x192/0x1b0
> [  139.895322]  bt_iter+0x52/0x60
> [  139.895323]  blk_mq_queue_tag_busy_iter+0x1a0/0x2e0
> [  139.895325]  ? __switch_to_asm+0x40/0x70
> [  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
> [  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
> [  139.895329]  ? compat_start_thread+0x20/0x40
> [  139.895330]  blk_mq_timeout_work+0x5a/0x130
> [  139.895333]  process_one_work+0x1ab/0x380
> [  139.895334]  worker_thread+0x37/0x3b0
> [  139.895335]  kthread+0x120/0x140
> [  139.895337]  ? create_worker+0x1b0/0x1b0
> [  139.895337]  ? kthread_park+0x90/0x90
> [  139.895339]  ret_from_fork+0x35/0x40

The driver reclaimed all outstanding tags and returned them to the block
layer. This isn't faking a timeout anymore. The driver has done its
part to reclaim lost commands. This is faking a broken block layer
instead.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler
  2020-05-27 18:09 ` Alan Adamson
  2020-05-27 18:52   ` Keith Busch
@ 2020-05-28  1:36   ` Ming Lei
  1 sibling, 0 replies; 14+ messages in thread
From: Ming Lei @ 2020-05-28  1:36 UTC (permalink / raw)
  To: Alan Adamson; +Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig

Hi Alan,

On Wed, May 27, 2020 at 11:09:53AM -0700, Alan Adamson wrote:
> I tested this patch against a timeout test I've been working with and I'm
> getting a hang.
> 
> # cat block-err.sh
> set -x
> echo 100 > /sys/kernel/debug/fail_io_timeout/probability
> echo 1000 > /sys/kernel/debug/fail_io_timeout/times
> echo 1 > /sys/block/nvme0n1/io-timeout-fail
> dd if=/dev/nvme0n1 of=/dev/null bs=512 count=1
> 
> 
> # sh  block-err.sh
> + echo 100
> + echo 1000
> + echo 1
> + dd if=/dev/nvme0n1 of=/dev/null bs=512 count=1
> 
> **** Hang ****
> 
> # dmesg
> .
> .
> .
> [   79.403253] FAULT_INJECTION: forcing a failure.
>                name fail_io_timeout, interval 1, probability 100, space 0,
> times 1000
> [   79.403255] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.7.0-rc7+ #1
> [   79.403256] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [   79.403257] Call Trace:
> [   79.403259]  <IRQ>
> [   79.403267]  dump_stack+0x6d/0x9a
> [   79.403270]  should_fail.cold.5+0x32/0x42
> [   79.403273]  blk_should_fake_timeout+0x26/0x30
> [   79.403275]  blk_mq_complete_request+0x1b/0x120
> [   79.403280]  nvme_irq+0xd9/0x1f0 [nvme]
> [   79.403287]  __handle_irq_event_percpu+0x44/0x190
> [   79.403288]  handle_irq_event_percpu+0x32/0x80
> [   79.403290]  handle_irq_event+0x3b/0x5a
> [   79.403291]  handle_edge_irq+0x87/0x190
> [   79.403296]  do_IRQ+0x54/0xe0
> [   79.403299]  common_interrupt+0xf/0xf
> [   79.403300]  </IRQ>
> [   79.403305] RIP: 0010:cpuidle_enter_state+0xc1/0x400
> [   79.403307] Code: ff e8 e3 41 93 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00
> 00 f6 c4 02 0f 85 d2 02 00 00 31 ff e8 16 c3 99 ff fb 66 0f 1f 44 00 00 <45>
> 85 e4 0f 88 3d 02 00 00 49 63 c4 48 8d 14 40 48 8d 0c c5 00 00
> [   79.403308] RSP: 0018:ffffb97e8c54be40 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffdd
> [   79.403309] RAX: ffff9781bf76cc40 RBX: ffffd95e7f743200 RCX:
> 000000000000001f
> [   79.403310] RDX: 000000127ccd6e6c RSI: 0000000031573862 RDI:
> 0000000000000000
> [   79.403310] RBP: ffffb97e8c54be80 R08: 0000000000000002 R09:
> 000000000002c4c0
> [   79.403311] R10: 011b921e580bc454 R11: ffff9781bf76bb44 R12:
> 0000000000000002
> [   79.403311] R13: ffffffffbd14c120 R14: ffffffffbd14c208 R15:
> ffffffffbd14c1f0
> [   79.403314]  cpuidle_enter+0x2e/0x40
> [   79.403318]  call_cpuidle+0x23/0x40
> [   79.403319]  do_idle+0x230/0x270
> [   79.403320]  cpu_startup_entry+0x1d/0x20
> [   79.403325]  start_secondary+0x170/0x1c0
> [   79.403329]  secondary_startup_64+0xb6/0xc0
> [  109.674334] nvme nvme0: I/O 754 QID 34 timeout, aborting
> [  109.674395] nvme nvme0: Abort status: 0x0
> [  139.879453] nvme nvme0: I/O 754 QID 34 timeout, reset controller
> [  139.895263] FAULT_INJECTION: forcing a failure.
>                name fail_io_timeout, interval 1, probability 100, space 0,
> times 999
> [  139.895265] CPU: 5 PID: 2470 Comm: kworker/5:1H Not tainted 5.7.0-rc7+ #1
> [  139.895266] Hardware name: Oracle Corporation ORACLE SERVER
> X6-2/ASM,MOTHERBOARD,1U, BIOS 38050100 08/30/2016
> [  139.895271] Workqueue: kblockd blk_mq_timeout_work
> [  139.895272] Call Trace:
> [  139.895279]  dump_stack+0x6d/0x9a
> [  139.895281]  should_fail.cold.5+0x32/0x42
> [  139.895282]  blk_should_fake_timeout+0x26/0x30
> [  139.895283]  blk_mq_complete_request+0x1b/0x120
> [  139.895292]  nvme_cancel_request+0x33/0x80 [nvme_core]
> [  139.895296]  bt_tags_iter+0x48/0x50
> [  139.895297]  blk_mq_tagset_busy_iter+0x1eb/0x270
> [  139.895299]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
> [  139.895301]  ? nvme_try_sched_reset+0x40/0x40 [nvme_core]
> [  139.895305]  nvme_dev_disable+0x2be/0x460 [nvme]
> [  139.895307]  nvme_timeout.cold.80+0x9c/0x182 [nvme]
> [  139.895311]  ? sched_clock+0x9/0x10
> [  139.895315]  ? sched_clock_cpu+0x11/0xc0
> [  139.895320]  ? __switch_to_asm+0x40/0x70
> [  139.895321]  blk_mq_check_expired+0x192/0x1b0
> [  139.895322]  bt_iter+0x52/0x60
> [  139.895323]  blk_mq_queue_tag_busy_iter+0x1a0/0x2e0
> [  139.895325]  ? __switch_to_asm+0x40/0x70
> [  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
> [  139.895326]  ? __blk_mq_requeue_request+0xf0/0xf0
> [  139.895329]  ? compat_start_thread+0x20/0x40
> [  139.895330]  blk_mq_timeout_work+0x5a/0x130
> [  139.895333]  process_one_work+0x1ab/0x380
> [  139.895334]  worker_thread+0x37/0x3b0
> [  139.895335]  kthread+0x120/0x140
> [  139.895337]  ? create_worker+0x1b0/0x1b0
> [  139.895337]  ? kthread_park+0x90/0x90
> [  139.895339]  ret_from_fork+0x35/0x40
> [  139.897859] nvme nvme0: Shutdown timeout set to 10 seconds
> [  139.901186] nvme nvme0: 56/0/0 default/read/poll queues

The above just shows the stack trace which fakes the timeout failure.
Not see any hang stack trace.

The reason is that you set 100% failure, then no any request can move
on. And each request is tried 100 times, which may take too long
to complete given the default timeout is 30sec.

 echo 100 > /sys/kernel/debug/fail_io_timeout/probability
 echo 1000 > /sys/kernel/debug/fail_io_timeout/times

If you stop the injection via below command after some time, such as 5min
since starting the script, you will see the test done with this patchset.

	echo 0 > /sys/block/nvme0n1/io-timeout-fail

Without this patches, the controller can be removed very soon.

Thanks,
Ming

> 
> On 5/20/20 4:56 AM, Ming Lei wrote:
> > Hi,
> > 
> > For nvme-pci, after controller is recovered, in-flight IOs are waited
> > before updating nr hw queues. If new controller error happens during
> > this period, nvme-pci driver deletes the controller and fails in-flight
> > IO. This way is too violent, and not friendly from user viewpoint.
> > 
> > Add APIs for checking if queue is frozen, and replace nvme_wait_freeze
> > in nvme-pci reset handler with checking if all ns queues are frozen &
> > controller disabled. Then a fresh new reset can be scheduled for
> > handling new controller error during waiting for in-flight IO completion.
> > 
> > So deleting controller & failing IOs can be avoided in this situation.
> > 
> > Without this patches, when fail io timeout injection is run, the
> > controller can be removed very quickly. With this patch, no controller
> > removing can be observed, and controller can recover to normal state
> > after stopping to inject io timeout failure.
> > 
> > Ming Lei (3):
> >    blk-mq: add API of blk_mq_queue_frozen
> >    nvme: add nvme_frozen
> >    nvme-pci: make nvme reset more reliable
> > 
> >   block/blk-mq.c           |  6 ++++++
> >   drivers/nvme/host/core.c | 14 ++++++++++++++
> >   drivers/nvme/host/nvme.h |  1 +
> >   drivers/nvme/host/pci.c  | 37 ++++++++++++++++++++++++++++++-------
> >   include/linux/blk-mq.h   |  1 +
> >   5 files changed, 52 insertions(+), 7 deletions(-)
> > 
> 
> _______________________________________________
> linux-nvme mailing list
> linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

-- 
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 11:56 [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
2020-05-20 11:56 ` [PATCH 1/3] blk-mq: add API of blk_mq_queue_frozen Ming Lei
2020-05-20 11:56 ` [PATCH 2/3] nvme: add nvme_frozen Ming Lei
2020-05-20 11:56 ` [PATCH 3/3] nvme-pci: make nvme reset more reliable Ming Lei
2020-05-20 17:10   ` Dongli Zhang
2020-05-20 17:27     ` Dongli Zhang
2020-05-20 17:52     ` Keith Busch
2020-05-21  2:33     ` Ming Lei
2020-05-26  5:01   ` Dongli Zhang
2020-05-26  7:12     ` Ming Lei
2020-05-26  2:55 ` [PATCH 0/3] blk-mq/nvme: improve nvme-pci reset handler Ming Lei
2020-05-27 18:09 ` Alan Adamson
2020-05-27 18:52   ` Keith Busch
2020-05-28  1:36   ` Ming Lei

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git