[PATCH] nvme: mark ctrl as DEAD if removing from error recovery

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
@ 2023-06-28  3:12 Ming Lei
  2023-06-28  6:44 ` Sagi Grimberg
  2023-06-28  7:39 ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Ming Lei @ 2023-06-28  3:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, Ming Lei, stable

namespace's request queue is frozen and quiesced during error recovering,
writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <- del_gendisk()
can't move on, and causes IO hang. Removal could be from sysfs, hard
unplug or error handling.

Fix this kind of issue by marking controller as DEAD if removal breaks
error recovery.

This ways is reasonable too, because controller can't be recovered any
more after being removed.

Cc: stable@vger.kernel.org
Reported-by: Chunguang Xu <brookxu.cn@gmail.com>
Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 4 +++-
 drivers/nvme/host/nvme.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fdfcf2781c85..b4cebc01cc00 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -567,6 +567,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	}
 
 	if (changed) {
+		ctrl->old_state = ctrl->state;
 		ctrl->state = new_state;
 		wake_up_all(&ctrl->state_wq);
 	}
@@ -4055,7 +4056,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	 * removing the namespaces' disks; fail all the queues now to avoid
 	 * potentially having to clean up the failed sync later.
 	 */
-	if (ctrl->state == NVME_CTRL_DEAD) {
+	if (ctrl->state == NVME_CTRL_DEAD ||
+			ctrl->old_state != NVME_CTRL_LIVE) {
 		nvme_mark_namespaces_dead(ctrl);
 		nvme_unquiesce_io_queues(ctrl);
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a98c14c552a..ce67856d4d4f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -254,6 +254,7 @@ struct nvme_ctrl {
 	bool comp_seen;
 	bool identified;
 	enum nvme_ctrl_state state;
+	enum nvme_ctrl_state old_state;
 	spinlock_t lock;
 	struct mutex scan_lock;
 	const struct nvme_ctrl_ops *ops;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-06-28  3:12 [PATCH] nvme: mark ctrl as DEAD if removing from error recovery Ming Lei
@ 2023-06-28  6:44 ` Sagi Grimberg
  2023-07-09  7:38   ` Sagi Grimberg
  2023-06-28  7:39 ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-06-28  6:44 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, stable


> namespace's request queue is frozen and quiesced during error recovering,
> writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <- del_gendisk()
> can't move on, and causes IO hang. Removal could be from sysfs, hard
> unplug or error handling.
> 
> Fix this kind of issue by marking controller as DEAD if removal breaks
> error recovery.
> 
> This ways is reasonable too, because controller can't be recovered any
> more after being removed.

This looks fine to me Ming,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


I still want your patches for tcp/rdma that move the freeze.
If you are not planning to send them, I swear I will :)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-06-28  3:12 [PATCH] nvme: mark ctrl as DEAD if removing from error recovery Ming Lei
  2023-06-28  6:44 ` Sagi Grimberg
@ 2023-06-28  7:39 ` Christoph Hellwig
  2023-06-28  8:10   ` Sagi Grimberg
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-06-28  7:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Yi Zhang, Chunguang Xu, stable

> @@ -4055,7 +4056,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>  	 * removing the namespaces' disks; fail all the queues now to avoid
>  	 * potentially having to clean up the failed sync later.
>  	 */
> -	if (ctrl->state == NVME_CTRL_DEAD) {
> +	if (ctrl->state == NVME_CTRL_DEAD ||
> +			ctrl->old_state != NVME_CTRL_LIVE) {

Style nitpick: this is really nasty to read, pleas align things
properly:

	if (ctrl->state == NVME_CTRL_DEAD ||
	    ctrl->old_state != NVME_CTRL_LIVE) {

But I'm really worried about this completely uncodumented and fragile
magic here.  The existing magic NVME_CTRL_DEAD is bad enough, but
backtracking to the previous state just makes this worse.

I think a big problem is that the call to nvme_mark_namespaces_dead
and nvme_unquiesce_io_queues is hidden inside of
nvme_remove_namespaces, and I think there's no good way around
us unwinding the currently unpair quiesce calls and fixing this
for real.
  T

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-06-28  7:39 ` Christoph Hellwig
@ 2023-06-28  8:10   ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2023-06-28  8:10 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: Keith Busch, linux-nvme, Yi Zhang, Chunguang Xu, stable


>> @@ -4055,7 +4056,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>>   	 * removing the namespaces' disks; fail all the queues now to avoid
>>   	 * potentially having to clean up the failed sync later.
>>   	 */
>> -	if (ctrl->state == NVME_CTRL_DEAD) {
>> +	if (ctrl->state == NVME_CTRL_DEAD ||
>> +			ctrl->old_state != NVME_CTRL_LIVE) {
> 
> Style nitpick: this is really nasty to read, pleas align things
> properly:
> 
> 	if (ctrl->state == NVME_CTRL_DEAD ||
> 	    ctrl->old_state != NVME_CTRL_LIVE) {
> 
> But I'm really worried about this completely uncodumented and fragile
> magic here.  The existing magic NVME_CTRL_DEAD is bad enough, but
> backtracking to the previous state just makes this worse.

Agree.

> I think a big problem is that the call to nvme_mark_namespaces_dead
> and nvme_unquiesce_io_queues is hidden inside of
> nvme_remove_namespaces, and I think there's no good way around
> us unwinding the currently unpair quiesce calls and fixing this
> for real.

My suggestion initially was to add a ctrl interface to say if the
transport can accept I/O: ctrl.ops.io_capable() and mark dead_unquiesce
based on that instead of the state. But I'm not sure it is better.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-06-28  6:44 ` Sagi Grimberg
@ 2023-07-09  7:38   ` Sagi Grimberg
  2023-07-10  3:02     ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-09  7:38 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, stable


>> namespace's request queue is frozen and quiesced during error recovering,
>> writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <- 
>> del_gendisk()
>> can't move on, and causes IO hang. Removal could be from sysfs, hard
>> unplug or error handling.
>>
>> Fix this kind of issue by marking controller as DEAD if removal breaks
>> error recovery.
>>
>> This ways is reasonable too, because controller can't be recovered any
>> more after being removed.
> 
> This looks fine to me Ming,
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> 
> 
> I still want your patches for tcp/rdma that move the freeze.
> If you are not planning to send them, I swear I will :)

Ming, can you please send the tcp/rdma patches that move the
freeze? As I said before, it addresses an existing issue with
requests unnecessarily blocked on a frozen queue instead of
failing over.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-09  7:38   ` Sagi Grimberg
@ 2023-07-10  3:02     ` Ming Lei
  2023-07-10  8:23       ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-07-10  3:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable

On Sun, Jul 09, 2023 at 10:38:29AM +0300, Sagi Grimberg wrote:
> 
> > > namespace's request queue is frozen and quiesced during error recovering,
> > > writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <-
> > > del_gendisk()
> > > can't move on, and causes IO hang. Removal could be from sysfs, hard
> > > unplug or error handling.
> > > 
> > > Fix this kind of issue by marking controller as DEAD if removal breaks
> > > error recovery.
> > > 
> > > This ways is reasonable too, because controller can't be recovered any
> > > more after being removed.
> > 
> > This looks fine to me Ming,
> > Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> > 
> > 
> > I still want your patches for tcp/rdma that move the freeze.
> > If you are not planning to send them, I swear I will :)
> 
> Ming, can you please send the tcp/rdma patches that move the
> freeze? As I said before, it addresses an existing issue with
> requests unnecessarily blocked on a frozen queue instead of
> failing over.

Any chance to fix the current issue in one easy(backportable) way[1] first?

All previous discussions on delay freeze[2] are generic, which apply on all
nvme drivers, not mention this error handling difference causes extra maintain
burden. I still suggest to convert all drivers in same way, and will work
along the approach[1] aiming for v6.6.


[1] https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc

Thanks,
Ming



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10  3:02     ` Ming Lei
@ 2023-07-10  8:23       ` Sagi Grimberg
  2023-07-10  8:57         ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-10  8:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable

>>>> namespace's request queue is frozen and quiesced during error recovering,
>>>> writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <-
>>>> del_gendisk()
>>>> can't move on, and causes IO hang. Removal could be from sysfs, hard
>>>> unplug or error handling.
>>>>
>>>> Fix this kind of issue by marking controller as DEAD if removal breaks
>>>> error recovery.
>>>>
>>>> This ways is reasonable too, because controller can't be recovered any
>>>> more after being removed.
>>>
>>> This looks fine to me Ming,
>>> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
>>>
>>>
>>> I still want your patches for tcp/rdma that move the freeze.
>>> If you are not planning to send them, I swear I will :)
>>
>> Ming, can you please send the tcp/rdma patches that move the
>> freeze? As I said before, it addresses an existing issue with
>> requests unnecessarily blocked on a frozen queue instead of
>> failing over.
> 
> Any chance to fix the current issue in one easy(backportable) way[1] first?

There is, you suggested one. And I'm requesting you to send a patch for
it.

> 
> All previous discussions on delay freeze[2] are generic, which apply on all
> nvme drivers, not mention this error handling difference causes extra maintain
> burden. I still suggest to convert all drivers in same way, and will work
> along the approach[1] aiming for v6.6.

But we obviously hit a difference in expectations from different
drivers. In tcp/rdma there is currently an _existing_ bug, where
we freeze the queue on error recovery, and unfreeze only after we
reconnect. In the meantime, requests can be blocked on the frozen
request queue and not failover like they should.

In fabrics the delta between error recovery and reconnect can (and
often will be) minutes or more. Hence I request that we solve _this_
issue which is addressed by moving the freeze to the reconnect path.

I personally think that pci should do this as well, and at least
dual-ported multipath pci devices would prefer instant failover
than after a full reset cycle. But Keith disagrees and I am not going to
push for it.

Regardless of anything we do in pci, the tcp/rdma transport 
freeze-blocking-failover _must_ be addressed.

So can you please submit a patch for each? Please phrase it as what
it is, a bug fix, so stable kernels can pick it up. And try to keep
it isolated to _only_ the freeze change so that it is easily
backportable.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10  8:23       ` Sagi Grimberg
@ 2023-07-10  8:57         ` Ming Lei
  2023-07-10  9:27           ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-07-10  8:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable, ming.lei

On Mon, Jul 10, 2023 at 11:23:44AM +0300, Sagi Grimberg wrote:
> 
> > > > > namespace's request queue is frozen and quiesced during error recovering,
> > > > > writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <-
> > > > > del_gendisk()
> > > > > can't move on, and causes IO hang. Removal could be from sysfs, hard
> > > > > unplug or error handling.
> > > > > 
> > > > > Fix this kind of issue by marking controller as DEAD if removal breaks
> > > > > error recovery.
> > > > > 
> > > > > This ways is reasonable too, because controller can't be recovered any
> > > > > more after being removed.
> > > > 
> > > > This looks fine to me Ming,
> > > > Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> > > > 
> > > > 
> > > > I still want your patches for tcp/rdma that move the freeze.
> > > > If you are not planning to send them, I swear I will :)
> > > 
> > > Ming, can you please send the tcp/rdma patches that move the
> > > freeze? As I said before, it addresses an existing issue with
> > > requests unnecessarily blocked on a frozen queue instead of
> > > failing over.
> > 
> > Any chance to fix the current issue in one easy(backportable) way[1] first?
> 
> There is, you suggested one. And I'm requesting you to send a patch for
> it.

The patch is the one pointed by link [1], and it still can be applied on current
linus tree.

https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/

> 
> > 
> > All previous discussions on delay freeze[2] are generic, which apply on all
> > nvme drivers, not mention this error handling difference causes extra maintain
> > burden. I still suggest to convert all drivers in same way, and will work
> > along the approach[1] aiming for v6.6.
> 
> But we obviously hit a difference in expectations from different
> drivers. In tcp/rdma there is currently an _existing_ bug, where
> we freeze the queue on error recovery, and unfreeze only after we
> reconnect. In the meantime, requests can be blocked on the frozen
> request queue and not failover like they should.
> 
> In fabrics the delta between error recovery and reconnect can (and
> often will be) minutes or more. Hence I request that we solve _this_
> issue which is addressed by moving the freeze to the reconnect path.
> 
> I personally think that pci should do this as well, and at least
> dual-ported multipath pci devices would prefer instant failover
> than after a full reset cycle. But Keith disagrees and I am not going to
> push for it.
> 
> Regardless of anything we do in pci, the tcp/rdma transport
> freeze-blocking-failover _must_ be addressed.

It is one generic issue, freeze/unfreeze has to be paired strictly
for every driver.

For any nvme driver, the inbalance can happen when error handling
is involved, that is why I suggest to fix the issue in one generic
way.

> 
> So can you please submit a patch for each? Please phrase it as what
> it is, a bug fix, so stable kernels can pick it up. And try to keep
> it isolated to _only_ the freeze change so that it is easily
> backportable.

The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.

But as we discussed, we still want to call freeze/unfreeze in pair, and
I also suggest the following approach[2], which isn't good to backport:

	1) moving freeze into reset
	
	2) during resetting
	
	- freeze NS queues
	- unquiesce NS queues
	- nvme_wait_freeze()
	- update_nr_hw_queues
	- unfreeze NS queues
	
	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
	
	- if the request is FS IO with data, re-submit all bios of this request, and free the request
	
	- otherwise, fail the request


[2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc


thanks,
Ming



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10  8:57         ` Ming Lei
@ 2023-07-10  9:27           ` Sagi Grimberg
  2023-07-10  9:50             ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-10  9:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable


>>>>> I still want your patches for tcp/rdma that move the freeze.
>>>>> If you are not planning to send them, I swear I will :)
>>>>
>>>> Ming, can you please send the tcp/rdma patches that move the
>>>> freeze? As I said before, it addresses an existing issue with
>>>> requests unnecessarily blocked on a frozen queue instead of
>>>> failing over.
>>>
>>> Any chance to fix the current issue in one easy(backportable) way[1] first?
>>
>> There is, you suggested one. And I'm requesting you to send a patch for
>> it.
> 
> The patch is the one pointed by link [1], and it still can be applied on current
> linus tree.
> 
> https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/

This is separate from what I am talking about.

>>> All previous discussions on delay freeze[2] are generic, which apply on all
>>> nvme drivers, not mention this error handling difference causes extra maintain
>>> burden. I still suggest to convert all drivers in same way, and will work
>>> along the approach[1] aiming for v6.6.
>>
>> But we obviously hit a difference in expectations from different
>> drivers. In tcp/rdma there is currently an _existing_ bug, where
>> we freeze the queue on error recovery, and unfreeze only after we
>> reconnect. In the meantime, requests can be blocked on the frozen
>> request queue and not failover like they should.
>>
>> In fabrics the delta between error recovery and reconnect can (and
>> often will be) minutes or more. Hence I request that we solve _this_
>> issue which is addressed by moving the freeze to the reconnect path.
>>
>> I personally think that pci should do this as well, and at least
>> dual-ported multipath pci devices would prefer instant failover
>> than after a full reset cycle. But Keith disagrees and I am not going to
>> push for it.
>>
>> Regardless of anything we do in pci, the tcp/rdma transport
>> freeze-blocking-failover _must_ be addressed.
> 
> It is one generic issue, freeze/unfreeze has to be paired strictly
> for every driver.
> 
> For any nvme driver, the inbalance can happen when error handling
> is involved, that is why I suggest to fix the issue in one generic
> way.

Ming, you are ignoring what I'm saying. I don't care if the
freeze/unfreeze is 100% balanced or not (for the sake of this
discussion).

I'm talking about a _separate_ issue where a queue
is frozen for potentially many minutes blocking requests that
could otherwise failover.

>> So can you please submit a patch for each? Please phrase it as what
>> it is, a bug fix, so stable kernels can pick it up. And try to keep
>> it isolated to _only_ the freeze change so that it is easily
>> backportable.
> 
> The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
> recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.

Ming, this is completely separate from what I'm talking about. This one
is addressing when the controller is removed, while I'm talking about
the error-recovery and failover, which is ages before the controller is
removed.

> But as we discussed, we still want to call freeze/unfreeze in pair, and
> I also suggest the following approach[2], which isn't good to backport:
> 
> 	1) moving freeze into reset
> 	
> 	2) during resetting
> 	
> 	- freeze NS queues
> 	- unquiesce NS queues
> 	- nvme_wait_freeze()
> 	- update_nr_hw_queues
> 	- unfreeze NS queues
> 	
> 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
> 	
> 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
> 	
> 	- otherwise, fail the request
> 
> 
> [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc

Ming, please read again what my concern is. I'm talking about error 
recovery freezing a queue, and unfreezing only after we reconnect,
blocking requests that should failover.

The controller removal when the request queue is frozen is a separate
issue, which we should address, but separately.

What I am requesting is that you send two patches, one for tcp and
one for rdma that are identical to what you did in:
[PATCH 2/2] nvme: don't freeze/unfreeze queues from different contexts

rdma.c patch:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0eb79696fb73..354cce8853c1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -918,6 +918,7 @@ static int nvme_rdma_configure_io_queues(struct 
nvme_rdma_ctrl *ctrl, bool new)
  		goto out_cleanup_tagset;

  	if (!new) {
+		nvme_start_freeze(&ctrl->ctrl);
  		nvme_unquiesce_io_queues(&ctrl->ctrl);
  		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
  			/*
@@ -926,6 +927,7 @@ static int nvme_rdma_configure_io_queues(struct 
nvme_rdma_ctrl *ctrl, bool new)
  			 * to be safe.
  			 */
  			ret = -ENODEV;
+			nvme_unfreeze(&ctrl->ctrl);
  			goto out_wait_freeze_timed_out;
  		}
  		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
@@ -975,7 +977,6 @@ static void nvme_rdma_teardown_io_queues(struct 
nvme_rdma_ctrl *ctrl,
  		bool remove)
  {
  	if (ctrl->ctrl.queue_count > 1) {
-		nvme_start_freeze(&ctrl->ctrl);
  		nvme_quiesce_io_queues(&ctrl->ctrl);
  		nvme_sync_io_queues(&ctrl->ctrl);
  		nvme_rdma_stop_io_queues(ctrl);
--

tcp.c patch:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..5ae08e9cb16d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1909,6 +1909,7 @@ static int nvme_tcp_configure_io_queues(struct 
nvme_ctrl *ctrl, bool new)
  		goto out_cleanup_connect_q;

  	if (!new) {
+		nvme_start_freeze(ctrl);
  		nvme_unquiesce_io_queues(ctrl);
  		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
  			/*
@@ -1917,6 +1918,7 @@ static int nvme_tcp_configure_io_queues(struct 
nvme_ctrl *ctrl, bool new)
  			 * to be safe.
  			 */
  			ret = -ENODEV;
+			nvme_unfreeze(ctrl);
  			goto out_wait_freeze_timed_out;
  		}
  		blk_mq_update_nr_hw_queues(ctrl->tagset,
@@ -2021,7 +2023,6 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
  	if (ctrl->queue_count <= 1)
  		return;
  	nvme_quiesce_admin_queue(ctrl);
-	nvme_start_freeze(ctrl);
  	nvme_quiesce_io_queues(ctrl);
  	nvme_sync_io_queues(ctrl);
  	nvme_tcp_stop_io_queues(ctrl);
--

Nothing more.


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10  9:27           ` Sagi Grimberg
@ 2023-07-10  9:50             ` Ming Lei
  2023-07-10 11:01               ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-07-10  9:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable, ming.lei

On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
> 
> > > > > > I still want your patches for tcp/rdma that move the freeze.
> > > > > > If you are not planning to send them, I swear I will :)
> > > > > 
> > > > > Ming, can you please send the tcp/rdma patches that move the
> > > > > freeze? As I said before, it addresses an existing issue with
> > > > > requests unnecessarily blocked on a frozen queue instead of
> > > > > failing over.
> > > > 
> > > > Any chance to fix the current issue in one easy(backportable) way[1] first?
> > > 
> > > There is, you suggested one. And I'm requesting you to send a patch for
> > > it.
> > 
> > The patch is the one pointed by link [1], and it still can be applied on current
> > linus tree.
> > 
> > https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
> 
> This is separate from what I am talking about.
> 
> > > > All previous discussions on delay freeze[2] are generic, which apply on all
> > > > nvme drivers, not mention this error handling difference causes extra maintain
> > > > burden. I still suggest to convert all drivers in same way, and will work
> > > > along the approach[1] aiming for v6.6.
> > > 
> > > But we obviously hit a difference in expectations from different
> > > drivers. In tcp/rdma there is currently an _existing_ bug, where
> > > we freeze the queue on error recovery, and unfreeze only after we
> > > reconnect. In the meantime, requests can be blocked on the frozen
> > > request queue and not failover like they should.
> > > 
> > > In fabrics the delta between error recovery and reconnect can (and
> > > often will be) minutes or more. Hence I request that we solve _this_
> > > issue which is addressed by moving the freeze to the reconnect path.
> > > 
> > > I personally think that pci should do this as well, and at least
> > > dual-ported multipath pci devices would prefer instant failover
> > > than after a full reset cycle. But Keith disagrees and I am not going to
> > > push for it.
> > > 
> > > Regardless of anything we do in pci, the tcp/rdma transport
> > > freeze-blocking-failover _must_ be addressed.
> > 
> > It is one generic issue, freeze/unfreeze has to be paired strictly
> > for every driver.
> > 
> > For any nvme driver, the inbalance can happen when error handling
> > is involved, that is why I suggest to fix the issue in one generic
> > way.
> 
> Ming, you are ignoring what I'm saying. I don't care if the
> freeze/unfreeze is 100% balanced or not (for the sake of this
> discussion).
> 
> I'm talking about a _separate_ issue where a queue
> is frozen for potentially many minutes blocking requests that
> could otherwise failover.
> 
> > > So can you please submit a patch for each? Please phrase it as what
> > > it is, a bug fix, so stable kernels can pick it up. And try to keep
> > > it isolated to _only_ the freeze change so that it is easily
> > > backportable.
> > 
> > The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
> > recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
> 
> Ming, this is completely separate from what I'm talking about. This one
> is addressing when the controller is removed, while I'm talking about
> the error-recovery and failover, which is ages before the controller is
> removed.
> 
> > But as we discussed, we still want to call freeze/unfreeze in pair, and
> > I also suggest the following approach[2], which isn't good to backport:
> > 
> > 	1) moving freeze into reset
> > 	
> > 	2) during resetting
> > 	
> > 	- freeze NS queues
> > 	- unquiesce NS queues
> > 	- nvme_wait_freeze()
> > 	- update_nr_hw_queues
> > 	- unfreeze NS queues
> > 	
> > 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
> > 	
> > 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
> > 	
> > 	- otherwise, fail the request
> > 
> > 
> > [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
> 
> Ming, please read again what my concern is. I'm talking about error recovery
> freezing a queue, and unfreezing only after we reconnect,
> blocking requests that should failover.

From my understanding, nothing is special for tcp/rdma compared with
nvme-pci.

All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]

Queues are frozen during teardown, and unfreeze in reset or reconnect.

If the 2nd stage is failed or bypassed, queues could be left as frozen
& unquisced, and requests can't be handled, and io hang.

When tcp reconnect failed, nvme_delete_ctrl() is called for failing
requests & removing controller.

Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
can avoid this issue by calling blk_mark_disk_dead() which can fail any
request pending in bio_queue_enter().

If that isn't tcp/rdma's issue, can you explain it in details?

At least, from what you mentioned in the following link, seems it is
same with what I am trying to address.

https://lore.kernel.org/linux-nvme/b11743c1-6c58-5f7a-8dc9-2a1a065835d0@grimberg.me/T/#m5f07ee01cdc99b0b38305d8171e9085921df2bc2

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10  9:50             ` Ming Lei
@ 2023-07-10 11:01               ` Sagi Grimberg
  2023-07-10 11:31                 ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-10 11:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable



On 7/10/23 12:50, Ming Lei wrote:
> On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
>>
>>>>>>> I still want your patches for tcp/rdma that move the freeze.
>>>>>>> If you are not planning to send them, I swear I will :)
>>>>>>
>>>>>> Ming, can you please send the tcp/rdma patches that move the
>>>>>> freeze? As I said before, it addresses an existing issue with
>>>>>> requests unnecessarily blocked on a frozen queue instead of
>>>>>> failing over.
>>>>>
>>>>> Any chance to fix the current issue in one easy(backportable) way[1] first?
>>>>
>>>> There is, you suggested one. And I'm requesting you to send a patch for
>>>> it.
>>>
>>> The patch is the one pointed by link [1], and it still can be applied on current
>>> linus tree.
>>>
>>> https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
>>
>> This is separate from what I am talking about.
>>
>>>>> All previous discussions on delay freeze[2] are generic, which apply on all
>>>>> nvme drivers, not mention this error handling difference causes extra maintain
>>>>> burden. I still suggest to convert all drivers in same way, and will work
>>>>> along the approach[1] aiming for v6.6.
>>>>
>>>> But we obviously hit a difference in expectations from different
>>>> drivers. In tcp/rdma there is currently an _existing_ bug, where
>>>> we freeze the queue on error recovery, and unfreeze only after we
>>>> reconnect. In the meantime, requests can be blocked on the frozen
>>>> request queue and not failover like they should.
>>>>
>>>> In fabrics the delta between error recovery and reconnect can (and
>>>> often will be) minutes or more. Hence I request that we solve _this_
>>>> issue which is addressed by moving the freeze to the reconnect path.
>>>>
>>>> I personally think that pci should do this as well, and at least
>>>> dual-ported multipath pci devices would prefer instant failover
>>>> than after a full reset cycle. But Keith disagrees and I am not going to
>>>> push for it.
>>>>
>>>> Regardless of anything we do in pci, the tcp/rdma transport
>>>> freeze-blocking-failover _must_ be addressed.
>>>
>>> It is one generic issue, freeze/unfreeze has to be paired strictly
>>> for every driver.
>>>
>>> For any nvme driver, the inbalance can happen when error handling
>>> is involved, that is why I suggest to fix the issue in one generic
>>> way.
>>
>> Ming, you are ignoring what I'm saying. I don't care if the
>> freeze/unfreeze is 100% balanced or not (for the sake of this
>> discussion).
>>
>> I'm talking about a _separate_ issue where a queue
>> is frozen for potentially many minutes blocking requests that
>> could otherwise failover.
>>
>>>> So can you please submit a patch for each? Please phrase it as what
>>>> it is, a bug fix, so stable kernels can pick it up. And try to keep
>>>> it isolated to _only_ the freeze change so that it is easily
>>>> backportable.
>>>
>>> The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
>>> recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
>>
>> Ming, this is completely separate from what I'm talking about. This one
>> is addressing when the controller is removed, while I'm talking about
>> the error-recovery and failover, which is ages before the controller is
>> removed.
>>
>>> But as we discussed, we still want to call freeze/unfreeze in pair, and
>>> I also suggest the following approach[2], which isn't good to backport:
>>>
>>> 	1) moving freeze into reset
>>> 	
>>> 	2) during resetting
>>> 	
>>> 	- freeze NS queues
>>> 	- unquiesce NS queues
>>> 	- nvme_wait_freeze()
>>> 	- update_nr_hw_queues
>>> 	- unfreeze NS queues
>>> 	
>>> 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
>>> 	
>>> 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
>>> 	
>>> 	- otherwise, fail the request
>>>
>>>
>>> [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
>>
>> Ming, please read again what my concern is. I'm talking about error recovery
>> freezing a queue, and unfreezing only after we reconnect,
>> blocking requests that should failover.
> 
>  From my understanding, nothing is special for tcp/rdma compared with
> nvme-pci.

But there is... The expectations are different.

> All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]
> 
> Queues are frozen during teardown, and unfreeze in reset or reconnect.
> 
> If the 2nd stage is failed or bypassed, queues could be left as frozen
> & unquisced, and requests can't be handled, and io hang.
> 
> When tcp reconnect failed, nvme_delete_ctrl() is called for failing
> requests & removing controller.
> 
> Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
> can avoid this issue by calling blk_mark_disk_dead() which can fail any
> request pending in bio_queue_enter().
> 
> If that isn't tcp/rdma's issue, can you explain it in details?

Yes I can. The expectation in pci is that a reset lifetime will be a few
seconds. By lifetime I mean it starts and either succeeds or fails.

in fabrics the lifetime is many minutes. i.e. it starts when error
recovery kicks in, and either succeeds (reconnected) or fails (deleted
due to ctrl_loss_tmo). This can take a long period of time (if for
example the controller is down for maintenance/reboot).

Hence, while it may be acceptable that requests are blocked on
a frozen queue for the duration of a reset in pci, it is _not_
acceptable in fabrics. I/O should failover far sooner than that
and must _not_ be dependent on the success/failure or the reset.

Hence I requested that this is addressed specifically for fabrics
(tcp/rdma).

Now, I did mention that I think that pci should behave similar to this,
but that is arguable and is only my opinion, and should be separate from
fixing the issue noted above for tcp/rdma.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10 11:01               ` Sagi Grimberg
@ 2023-07-10 11:31                 ` Ming Lei
  2023-07-10 11:56                   ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-07-10 11:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable, ming.lei

On Mon, Jul 10, 2023 at 02:01:18PM +0300, Sagi Grimberg wrote:
> 
> 
> On 7/10/23 12:50, Ming Lei wrote:
> > On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
> > > 
> > > > > > > > I still want your patches for tcp/rdma that move the freeze.
> > > > > > > > If you are not planning to send them, I swear I will :)
> > > > > > > 
> > > > > > > Ming, can you please send the tcp/rdma patches that move the
> > > > > > > freeze? As I said before, it addresses an existing issue with
> > > > > > > requests unnecessarily blocked on a frozen queue instead of
> > > > > > > failing over.
> > > > > > 
> > > > > > Any chance to fix the current issue in one easy(backportable) way[1] first?
> > > > > 
> > > > > There is, you suggested one. And I'm requesting you to send a patch for
> > > > > it.
> > > > 
> > > > The patch is the one pointed by link [1], and it still can be applied on current
> > > > linus tree.
> > > > 
> > > > https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
> > > 
> > > This is separate from what I am talking about.
> > > 
> > > > > > All previous discussions on delay freeze[2] are generic, which apply on all
> > > > > > nvme drivers, not mention this error handling difference causes extra maintain
> > > > > > burden. I still suggest to convert all drivers in same way, and will work
> > > > > > along the approach[1] aiming for v6.6.
> > > > > 
> > > > > But we obviously hit a difference in expectations from different
> > > > > drivers. In tcp/rdma there is currently an _existing_ bug, where
> > > > > we freeze the queue on error recovery, and unfreeze only after we
> > > > > reconnect. In the meantime, requests can be blocked on the frozen
> > > > > request queue and not failover like they should.
> > > > > 
> > > > > In fabrics the delta between error recovery and reconnect can (and
> > > > > often will be) minutes or more. Hence I request that we solve _this_
> > > > > issue which is addressed by moving the freeze to the reconnect path.
> > > > > 
> > > > > I personally think that pci should do this as well, and at least
> > > > > dual-ported multipath pci devices would prefer instant failover
> > > > > than after a full reset cycle. But Keith disagrees and I am not going to
> > > > > push for it.
> > > > > 
> > > > > Regardless of anything we do in pci, the tcp/rdma transport
> > > > > freeze-blocking-failover _must_ be addressed.
> > > > 
> > > > It is one generic issue, freeze/unfreeze has to be paired strictly
> > > > for every driver.
> > > > 
> > > > For any nvme driver, the inbalance can happen when error handling
> > > > is involved, that is why I suggest to fix the issue in one generic
> > > > way.
> > > 
> > > Ming, you are ignoring what I'm saying. I don't care if the
> > > freeze/unfreeze is 100% balanced or not (for the sake of this
> > > discussion).
> > > 
> > > I'm talking about a _separate_ issue where a queue
> > > is frozen for potentially many minutes blocking requests that
> > > could otherwise failover.
> > > 
> > > > > So can you please submit a patch for each? Please phrase it as what
> > > > > it is, a bug fix, so stable kernels can pick it up. And try to keep
> > > > > it isolated to _only_ the freeze change so that it is easily
> > > > > backportable.
> > > > 
> > > > The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
> > > > recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
> > > 
> > > Ming, this is completely separate from what I'm talking about. This one
> > > is addressing when the controller is removed, while I'm talking about
> > > the error-recovery and failover, which is ages before the controller is
> > > removed.
> > > 
> > > > But as we discussed, we still want to call freeze/unfreeze in pair, and
> > > > I also suggest the following approach[2], which isn't good to backport:
> > > > 
> > > > 	1) moving freeze into reset
> > > > 	
> > > > 	2) during resetting
> > > > 	
> > > > 	- freeze NS queues
> > > > 	- unquiesce NS queues
> > > > 	- nvme_wait_freeze()
> > > > 	- update_nr_hw_queues
> > > > 	- unfreeze NS queues
> > > > 	
> > > > 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
> > > > 	
> > > > 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
> > > > 	
> > > > 	- otherwise, fail the request
> > > > 
> > > > 
> > > > [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
> > > 
> > > Ming, please read again what my concern is. I'm talking about error recovery
> > > freezing a queue, and unfreezing only after we reconnect,
> > > blocking requests that should failover.
> > 
> >  From my understanding, nothing is special for tcp/rdma compared with
> > nvme-pci.
> 
> But there is... The expectations are different.
> 
> > All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]
> > 
> > Queues are frozen during teardown, and unfreeze in reset or reconnect.
> > 
> > If the 2nd stage is failed or bypassed, queues could be left as frozen
> > & unquisced, and requests can't be handled, and io hang.
> > 
> > When tcp reconnect failed, nvme_delete_ctrl() is called for failing
> > requests & removing controller.
> > 
> > Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
> > can avoid this issue by calling blk_mark_disk_dead() which can fail any
> > request pending in bio_queue_enter().
> > 
> > If that isn't tcp/rdma's issue, can you explain it in details?
> 
> Yes I can. The expectation in pci is that a reset lifetime will be a few
> seconds. By lifetime I mean it starts and either succeeds or fails.
> 
> in fabrics the lifetime is many minutes. i.e. it starts when error
> recovery kicks in, and either succeeds (reconnected) or fails (deleted
> due to ctrl_loss_tmo). This can take a long period of time (if for
> example the controller is down for maintenance/reboot).
> 
> Hence, while it may be acceptable that requests are blocked on
> a frozen queue for the duration of a reset in pci, it is _not_
> acceptable in fabrics. I/O should failover far sooner than that
> and must _not_ be dependent on the success/failure or the reset.
> 
> Hence I requested that this is addressed specifically for fabrics
> (tcp/rdma).

OK, I got your idea now, but which is basically not doable from current
nvme error recovery approach.

Even though starting freeze is moved to reconnect stage, queue is still
quiesced, then request is kept in block layer's internal queue, and can't
enter nvme fabric .queue_rq().

The patch you are pushing can't work for this requirement, also I don't
think you need that, because FS IO is actually queued to nvme mpath queue,
which won't be frozen at all.

However, you can improve nvme_ns_head_submit_bio() by selecting new path
if the underlying queue is in error recovery. Not dig into the current
code yet, I think it should have been done in this way already.

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10 11:31                 ` Ming Lei
@ 2023-07-10 11:56                   ` Sagi Grimberg
  2023-07-10 13:27                     ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-10 11:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable



On 7/10/23 14:31, Ming Lei wrote:
> On Mon, Jul 10, 2023 at 02:01:18PM +0300, Sagi Grimberg wrote:
>>
>>
>> On 7/10/23 12:50, Ming Lei wrote:
>>> On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
>>>>
>>>>>>>>> I still want your patches for tcp/rdma that move the freeze.
>>>>>>>>> If you are not planning to send them, I swear I will :)
>>>>>>>>
>>>>>>>> Ming, can you please send the tcp/rdma patches that move the
>>>>>>>> freeze? As I said before, it addresses an existing issue with
>>>>>>>> requests unnecessarily blocked on a frozen queue instead of
>>>>>>>> failing over.
>>>>>>>
>>>>>>> Any chance to fix the current issue in one easy(backportable) way[1] first?
>>>>>>
>>>>>> There is, you suggested one. And I'm requesting you to send a patch for
>>>>>> it.
>>>>>
>>>>> The patch is the one pointed by link [1], and it still can be applied on current
>>>>> linus tree.
>>>>>
>>>>> https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
>>>>
>>>> This is separate from what I am talking about.
>>>>
>>>>>>> All previous discussions on delay freeze[2] are generic, which apply on all
>>>>>>> nvme drivers, not mention this error handling difference causes extra maintain
>>>>>>> burden. I still suggest to convert all drivers in same way, and will work
>>>>>>> along the approach[1] aiming for v6.6.
>>>>>>
>>>>>> But we obviously hit a difference in expectations from different
>>>>>> drivers. In tcp/rdma there is currently an _existing_ bug, where
>>>>>> we freeze the queue on error recovery, and unfreeze only after we
>>>>>> reconnect. In the meantime, requests can be blocked on the frozen
>>>>>> request queue and not failover like they should.
>>>>>>
>>>>>> In fabrics the delta between error recovery and reconnect can (and
>>>>>> often will be) minutes or more. Hence I request that we solve _this_
>>>>>> issue which is addressed by moving the freeze to the reconnect path.
>>>>>>
>>>>>> I personally think that pci should do this as well, and at least
>>>>>> dual-ported multipath pci devices would prefer instant failover
>>>>>> than after a full reset cycle. But Keith disagrees and I am not going to
>>>>>> push for it.
>>>>>>
>>>>>> Regardless of anything we do in pci, the tcp/rdma transport
>>>>>> freeze-blocking-failover _must_ be addressed.
>>>>>
>>>>> It is one generic issue, freeze/unfreeze has to be paired strictly
>>>>> for every driver.
>>>>>
>>>>> For any nvme driver, the inbalance can happen when error handling
>>>>> is involved, that is why I suggest to fix the issue in one generic
>>>>> way.
>>>>
>>>> Ming, you are ignoring what I'm saying. I don't care if the
>>>> freeze/unfreeze is 100% balanced or not (for the sake of this
>>>> discussion).
>>>>
>>>> I'm talking about a _separate_ issue where a queue
>>>> is frozen for potentially many minutes blocking requests that
>>>> could otherwise failover.
>>>>
>>>>>> So can you please submit a patch for each? Please phrase it as what
>>>>>> it is, a bug fix, so stable kernels can pick it up. And try to keep
>>>>>> it isolated to _only_ the freeze change so that it is easily
>>>>>> backportable.
>>>>>
>>>>> The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
>>>>> recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
>>>>
>>>> Ming, this is completely separate from what I'm talking about. This one
>>>> is addressing when the controller is removed, while I'm talking about
>>>> the error-recovery and failover, which is ages before the controller is
>>>> removed.
>>>>
>>>>> But as we discussed, we still want to call freeze/unfreeze in pair, and
>>>>> I also suggest the following approach[2], which isn't good to backport:
>>>>>
>>>>> 	1) moving freeze into reset
>>>>> 	
>>>>> 	2) during resetting
>>>>> 	
>>>>> 	- freeze NS queues
>>>>> 	- unquiesce NS queues
>>>>> 	- nvme_wait_freeze()
>>>>> 	- update_nr_hw_queues
>>>>> 	- unfreeze NS queues
>>>>> 	
>>>>> 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
>>>>> 	
>>>>> 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
>>>>> 	
>>>>> 	- otherwise, fail the request
>>>>>
>>>>>
>>>>> [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
>>>>
>>>> Ming, please read again what my concern is. I'm talking about error recovery
>>>> freezing a queue, and unfreezing only after we reconnect,
>>>> blocking requests that should failover.
>>>
>>>   From my understanding, nothing is special for tcp/rdma compared with
>>> nvme-pci.
>>
>> But there is... The expectations are different.
>>
>>> All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]
>>>
>>> Queues are frozen during teardown, and unfreeze in reset or reconnect.
>>>
>>> If the 2nd stage is failed or bypassed, queues could be left as frozen
>>> & unquisced, and requests can't be handled, and io hang.
>>>
>>> When tcp reconnect failed, nvme_delete_ctrl() is called for failing
>>> requests & removing controller.
>>>
>>> Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
>>> can avoid this issue by calling blk_mark_disk_dead() which can fail any
>>> request pending in bio_queue_enter().
>>>
>>> If that isn't tcp/rdma's issue, can you explain it in details?
>>
>> Yes I can. The expectation in pci is that a reset lifetime will be a few
>> seconds. By lifetime I mean it starts and either succeeds or fails.
>>
>> in fabrics the lifetime is many minutes. i.e. it starts when error
>> recovery kicks in, and either succeeds (reconnected) or fails (deleted
>> due to ctrl_loss_tmo). This can take a long period of time (if for
>> example the controller is down for maintenance/reboot).
>>
>> Hence, while it may be acceptable that requests are blocked on
>> a frozen queue for the duration of a reset in pci, it is _not_
>> acceptable in fabrics. I/O should failover far sooner than that
>> and must _not_ be dependent on the success/failure or the reset.
>>
>> Hence I requested that this is addressed specifically for fabrics
>> (tcp/rdma).
> 
> OK, I got your idea now, but which is basically not doable from current
> nvme error recovery approach.
> 
> Even though starting freeze is moved to reconnect stage, queue is still
> quiesced, then request is kept in block layer's internal queue, and can't
> enter nvme fabric .queue_rq().

See error-recovery in nvme_[tcp|rdma]_error_recovery(), the queue is
unquiesced for fast-failover.

> The patch you are pushing can't work for this requirement, also I don't
> think you need that, because FS IO is actually queued to nvme mpath queue,
> which won't be frozen at all.

But it will enter the ns request queue, and will block there. There is a
window where a *few* stray requests enter the ns request queue before
the current path is reset, those are the ones that will not failover
immediately (because they are blocked on a frozen queue).

> However, you can improve nvme_ns_head_submit_bio() by selecting new path
> if the underlying queue is in error recovery. Not dig into the current
> code yet, I think it should have been done in this way already.

That is fine, and it happens to new requests already. The nvme-mpath
failover is not broken obviously. I am only talking about the few
requests that entered the ns queue when it was frozen and before the
nshead current_path was changed.

For those requests, we should fast failover as well, and we don't today.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10 11:56                   ` Sagi Grimberg
@ 2023-07-10 13:27                     ` Ming Lei
  2023-07-10 14:00                       ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-07-10 13:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable, ming.lei

On Mon, Jul 10, 2023 at 02:56:08PM +0300, Sagi Grimberg wrote:
> 
> 
> On 7/10/23 14:31, Ming Lei wrote:
> > On Mon, Jul 10, 2023 at 02:01:18PM +0300, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 7/10/23 12:50, Ming Lei wrote:
> > > > On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
> > > > > 
> > > > > > > > > > I still want your patches for tcp/rdma that move the freeze.
> > > > > > > > > > If you are not planning to send them, I swear I will :)
> > > > > > > > > 
> > > > > > > > > Ming, can you please send the tcp/rdma patches that move the
> > > > > > > > > freeze? As I said before, it addresses an existing issue with
> > > > > > > > > requests unnecessarily blocked on a frozen queue instead of
> > > > > > > > > failing over.
> > > > > > > > 
> > > > > > > > Any chance to fix the current issue in one easy(backportable) way[1] first?
> > > > > > > 
> > > > > > > There is, you suggested one. And I'm requesting you to send a patch for
> > > > > > > it.
> > > > > > 
> > > > > > The patch is the one pointed by link [1], and it still can be applied on current
> > > > > > linus tree.
> > > > > > 
> > > > > > https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
> > > > > 
> > > > > This is separate from what I am talking about.
> > > > > 
> > > > > > > > All previous discussions on delay freeze[2] are generic, which apply on all
> > > > > > > > nvme drivers, not mention this error handling difference causes extra maintain
> > > > > > > > burden. I still suggest to convert all drivers in same way, and will work
> > > > > > > > along the approach[1] aiming for v6.6.
> > > > > > > 
> > > > > > > But we obviously hit a difference in expectations from different
> > > > > > > drivers. In tcp/rdma there is currently an _existing_ bug, where
> > > > > > > we freeze the queue on error recovery, and unfreeze only after we
> > > > > > > reconnect. In the meantime, requests can be blocked on the frozen
> > > > > > > request queue and not failover like they should.
> > > > > > > 
> > > > > > > In fabrics the delta between error recovery and reconnect can (and
> > > > > > > often will be) minutes or more. Hence I request that we solve _this_
> > > > > > > issue which is addressed by moving the freeze to the reconnect path.
> > > > > > > 
> > > > > > > I personally think that pci should do this as well, and at least
> > > > > > > dual-ported multipath pci devices would prefer instant failover
> > > > > > > than after a full reset cycle. But Keith disagrees and I am not going to
> > > > > > > push for it.
> > > > > > > 
> > > > > > > Regardless of anything we do in pci, the tcp/rdma transport
> > > > > > > freeze-blocking-failover _must_ be addressed.
> > > > > > 
> > > > > > It is one generic issue, freeze/unfreeze has to be paired strictly
> > > > > > for every driver.
> > > > > > 
> > > > > > For any nvme driver, the inbalance can happen when error handling
> > > > > > is involved, that is why I suggest to fix the issue in one generic
> > > > > > way.
> > > > > 
> > > > > Ming, you are ignoring what I'm saying. I don't care if the
> > > > > freeze/unfreeze is 100% balanced or not (for the sake of this
> > > > > discussion).
> > > > > 
> > > > > I'm talking about a _separate_ issue where a queue
> > > > > is frozen for potentially many minutes blocking requests that
> > > > > could otherwise failover.
> > > > > 
> > > > > > > So can you please submit a patch for each? Please phrase it as what
> > > > > > > it is, a bug fix, so stable kernels can pick it up. And try to keep
> > > > > > > it isolated to _only_ the freeze change so that it is easily
> > > > > > > backportable.
> > > > > > 
> > > > > > The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
> > > > > > recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
> > > > > 
> > > > > Ming, this is completely separate from what I'm talking about. This one
> > > > > is addressing when the controller is removed, while I'm talking about
> > > > > the error-recovery and failover, which is ages before the controller is
> > > > > removed.
> > > > > 
> > > > > > But as we discussed, we still want to call freeze/unfreeze in pair, and
> > > > > > I also suggest the following approach[2], which isn't good to backport:
> > > > > > 
> > > > > > 	1) moving freeze into reset
> > > > > > 	
> > > > > > 	2) during resetting
> > > > > > 	
> > > > > > 	- freeze NS queues
> > > > > > 	- unquiesce NS queues
> > > > > > 	- nvme_wait_freeze()
> > > > > > 	- update_nr_hw_queues
> > > > > > 	- unfreeze NS queues
> > > > > > 	
> > > > > > 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
> > > > > > 	
> > > > > > 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
> > > > > > 	
> > > > > > 	- otherwise, fail the request
> > > > > > 
> > > > > > 
> > > > > > [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
> > > > > 
> > > > > Ming, please read again what my concern is. I'm talking about error recovery
> > > > > freezing a queue, and unfreezing only after we reconnect,
> > > > > blocking requests that should failover.
> > > > 
> > > >   From my understanding, nothing is special for tcp/rdma compared with
> > > > nvme-pci.
> > > 
> > > But there is... The expectations are different.
> > > 
> > > > All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]
> > > > 
> > > > Queues are frozen during teardown, and unfreeze in reset or reconnect.
> > > > 
> > > > If the 2nd stage is failed or bypassed, queues could be left as frozen
> > > > & unquisced, and requests can't be handled, and io hang.
> > > > 
> > > > When tcp reconnect failed, nvme_delete_ctrl() is called for failing
> > > > requests & removing controller.
> > > > 
> > > > Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
> > > > can avoid this issue by calling blk_mark_disk_dead() which can fail any
> > > > request pending in bio_queue_enter().
> > > > 
> > > > If that isn't tcp/rdma's issue, can you explain it in details?
> > > 
> > > Yes I can. The expectation in pci is that a reset lifetime will be a few
> > > seconds. By lifetime I mean it starts and either succeeds or fails.
> > > 
> > > in fabrics the lifetime is many minutes. i.e. it starts when error
> > > recovery kicks in, and either succeeds (reconnected) or fails (deleted
> > > due to ctrl_loss_tmo). This can take a long period of time (if for
> > > example the controller is down for maintenance/reboot).
> > > 
> > > Hence, while it may be acceptable that requests are blocked on
> > > a frozen queue for the duration of a reset in pci, it is _not_
> > > acceptable in fabrics. I/O should failover far sooner than that
> > > and must _not_ be dependent on the success/failure or the reset.
> > > 
> > > Hence I requested that this is addressed specifically for fabrics
> > > (tcp/rdma).
> > 
> > OK, I got your idea now, but which is basically not doable from current
> > nvme error recovery approach.
> > 
> > Even though starting freeze is moved to reconnect stage, queue is still
> > quiesced, then request is kept in block layer's internal queue, and can't
> > enter nvme fabric .queue_rq().
> 
> See error-recovery in nvme_[tcp|rdma]_error_recovery(), the queue is
> unquiesced for fast-failover.

OK, sorry for missing the nvme_unquiesce_io_queues() called in
nvme_tcp_error_recovery_work().

After moving start_freeze to nvme_tcp_reconnect_ctrl_work, new request
can enter queue quickly, and all these requests may not be handled
after reconnection is done because queue topo may change. It looks not
an issue for mpath, but could be one trouble for !mpath, just like
nvme-pci. I guess you don't care !mpath?

nvme_tcp_queue_rq() highly depends on ctrl state for handling request
during error recovery, and this way is actually fragile, such as:

1) nvme_unquiesce_io_queues() has to be done after ctrl state is changed
to NVME_CTRL_CONNECTING, in nvme_tcp_error_recovery_work().

At least we should be careful for this change.

> 
> > The patch you are pushing can't work for this requirement, also I don't
> > think you need that, because FS IO is actually queued to nvme mpath queue,
> > which won't be frozen at all.
> 
> But it will enter the ns request queue, and will block there. There is a
> window where a *few* stray requests enter the ns request queue before
> the current path is reset, those are the ones that will not failover
> immediately (because they are blocked on a frozen queue).
> 
> > However, you can improve nvme_ns_head_submit_bio() by selecting new path
> > if the underlying queue is in error recovery. Not dig into the current
> > code yet, I think it should have been done in this way already.
> 
> That is fine, and it happens to new requests already. The nvme-mpath
> failover is not broken obviously. I am only talking about the few
> requests that entered the ns queue when it was frozen and before the
> nshead current_path was changed.
> 
> For those requests, we should fast failover as well, and we don't today.

OK.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] nvme: mark ctrl as DEAD if removing from error recovery
  2023-07-10 13:27                     ` Ming Lei
@ 2023-07-10 14:00                       ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2023-07-10 14:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Yi Zhang,
	Chunguang Xu, stable


>>> OK, I got your idea now, but which is basically not doable from current
>>> nvme error recovery approach.
>>>
>>> Even though starting freeze is moved to reconnect stage, queue is still
>>> quiesced, then request is kept in block layer's internal queue, and can't
>>> enter nvme fabric .queue_rq().
>>
>> See error-recovery in nvme_[tcp|rdma]_error_recovery(), the queue is
>> unquiesced for fast-failover.
> 
> OK, sorry for missing the nvme_unquiesce_io_queues() called in
> nvme_tcp_error_recovery_work().
> 
> After moving start_freeze to nvme_tcp_reconnect_ctrl_work, new request
> can enter queue quickly, and all these requests may not be handled
> after reconnection is done because queue topo may change. It looks not
> an issue for mpath, but could be one trouble for !mpath, just like
> nvme-pci. I guess you don't care !mpath?

First, !mpath is less of a concern for fabrics, although it should still
work.

In the !mpath case, if the cpu topology changes, then this needs to be
addressed specifically, and it is a secondary issue, far less important
than not failing over quickly.

> nvme_tcp_queue_rq() highly depends on ctrl state for handling request
> during error recovery, and this way is actually fragile, such as:
> 
> 1) nvme_unquiesce_io_queues() has to be done after ctrl state is changed
> to NVME_CTRL_CONNECTING, in nvme_tcp_error_recovery_work().
> 
> At least we should be careful for this change.

queue_rq() depends on ctrl->state but also queue state, and the latter
is guaranteed to be stable across quiesce/unquiesce. This is already
the case today.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-07-10 14:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-28  3:12 [PATCH] nvme: mark ctrl as DEAD if removing from error recovery Ming Lei
2023-06-28  6:44 ` Sagi Grimberg
2023-07-09  7:38   ` Sagi Grimberg
2023-07-10  3:02     ` Ming Lei
2023-07-10  8:23       ` Sagi Grimberg
2023-07-10  8:57         ` Ming Lei
2023-07-10  9:27           ` Sagi Grimberg
2023-07-10  9:50             ` Ming Lei
2023-07-10 11:01               ` Sagi Grimberg
2023-07-10 11:31                 ` Ming Lei
2023-07-10 11:56                   ` Sagi Grimberg
2023-07-10 13:27                     ` Ming Lei
2023-07-10 14:00                       ` Sagi Grimberg
2023-06-28  7:39 ` Christoph Hellwig
2023-06-28  8:10   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).