All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3] nvme-mpath: delete disk after last connection
@ 2021-05-01 12:04 Hannes Reinecke
  2021-05-04  8:54 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-01 12:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch,
	Hannes Reinecke, Daniel Wagner

From: Keith Busch <kbusch@kernel.org>

The multipath code currently deletes the disk only after all references
to it are dropped rather than when the last path to that disk is lost.
This has been reported to cause problems with some usage, like MD RAID.

Delete the disk when the last path is gone. This is the same behavior we
currently have with non-multipathed nvme devices.

The following is just a simple example that demonstrates what is currently
observed using a simple nvme loop back (loop setup file not shown):

 # nvmetcli restore loop.json
 [   31.156452] nvmet: adding nsid 1 to subsystem testnqn1
 [   31.159140] nvmet: adding nsid 1 to subsystem testnqn2

 # nvme connect -t loop -n testnqn1 -q hostnqn
 [   36.866302] nvmet: creating controller 1 for subsystem testnqn1 for NQN hostnqn.
 [   36.872926] nvme nvme3: new ctrl: "testnqn1"

 # nvme connect -t loop -n testnqn1 -q hostnqn
 [   38.227186] nvmet: creating controller 2 for subsystem testnqn1 for NQN hostnqn.
 [   38.234450] nvme nvme4: new ctrl: "testnqn1"

 # nvme connect -t loop -n testnqn2 -q hostnqn
 [   43.902761] nvmet: creating controller 3 for subsystem testnqn2 for NQN hostnqn.
 [   43.907401] nvme nvme5: new ctrl: "testnqn2"

 # nvme connect -t loop -n testnqn2 -q hostnqn
 [   44.627689] nvmet: creating controller 4 for subsystem testnqn2 for NQN hostnqn.
 [   44.641773] nvme nvme6: new ctrl: "testnqn2"

 # mdadm --create /dev/md0 --level=mirror --raid-devices=2 /dev/nvme3n1 /dev/nvme5n1
 [   53.497038] md/raid1:md0: active with 2 out of 2 mirrors
 [   53.501717] md0: detected capacity change from 0 to 66060288

 # cat /proc/mdstat
 Personalities : [raid1]
 md0 : active raid1 nvme5n1[1] nvme3n1[0]
       64512 blocks super 1.2 [2/2] [UU]

Now delete all paths to one of the namespaces:

 # echo 1 > /sys/class/nvme/nvme3/delete_controller
 # echo 1 > /sys/class/nvme/nvme4/delete_controller

We have no path, but mdstat says:

 # cat /proc/mdstat
 Personalities : [raid1]
 md0 : active raid1 nvme5n1[1] nvme3n1[0]
       64512 blocks super 1.2 [2/2] [UU]

And this is reported to cause a problem.

With the proposed patch, the following messages appear:

 [  227.516807] md/raid1:md0: Disk failure on nvme3n1, disabling device.
 [  227.516807] md/raid1:md0: Operation continuing on 1 devices.

And mdstat shows only the viable members:

 # cat /proc/mdstat
 Personalities : [raid1]
 md0 : active (auto-read-only) raid1 nvme5n1[1]
       64512 blocks super 1.2 [2/1] [_U]

Reported-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <daniel.wagner@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c      |  2 +-
 drivers/nvme/host/multipath.c |  1 -
 drivers/nvme/host/nvme.h      | 11 ++++++++++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e6612971f4eb..6ccf0a8e7fba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -546,7 +546,7 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
-	nvme_mpath_remove_disk(head);
+	nvme_mpath_put_disk(head);
 	ida_simple_remove(&head->subsys->ns_ida, head->instance);
 	cleanup_srcu_struct(&head->srcu);
 	nvme_put_subsystem(head->subsys);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0d0de3433f37..d92532f518e9 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -775,7 +775,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 		 */
 		head->disk->queue = NULL;
 	}
-	put_disk(head->disk);
 }
 
 int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 05f31a2c64bb..be92c3a685c0 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -719,12 +719,18 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 
+static inline void nvme_mpath_put_disk(struct nvme_ns_head *head)
+{
+	if (head->disk)
+		put_disk(head->disk);
+}
+
 static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
 {
 	struct nvme_ns_head *head = ns->head;
 
 	if (head->disk && list_empty(&head->list))
-		kblockd_schedule_work(&head->requeue_work);
+		nvme_mpath_remove_disk(head);
 }
 
 static inline void nvme_trace_bio_complete(struct request *req)
@@ -764,6 +770,9 @@ static inline void nvme_mpath_add_disk(struct nvme_ns *ns,
 		struct nvme_id_ns *id)
 {
 }
+static inline void nvme_mpath_put_disk(struct nvme_ns_head *head)
+{
+}
 static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
 }
-- 
2.26.2


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-01 12:04 [PATCHv3] nvme-mpath: delete disk after last connection Hannes Reinecke
@ 2021-05-04  8:54 ` Christoph Hellwig
  2021-05-04 13:40   ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-04  8:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Keith Busch, Daniel Wagner

As stated in the v3 review this is an incompatible change.  We'll need
the queue_if_no_path attribute first, and default it to on to keep
compatability.

Also this lacks a changelog documenting what changed from v2.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-04  8:54 ` Christoph Hellwig
@ 2021-05-04 13:40   ` Hannes Reinecke
  2021-05-04 19:54     ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-04 13:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/4/21 10:54 AM, Christoph Hellwig wrote:
> As stated in the v3 review this is an incompatible change.  We'll need
> the queue_if_no_path attribute first, and default it to on to keep
> compatability.
> 

That is what I tried the last time, but the direction I got was to treat 
both, NVMe-PCI and NVMe-oF identically:
(https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/)

 >>>> And yes, this is exactly the same setup, the only difference
 >>>> being the CMIC setting for the NVMe device.
 >>>
 >>> I agree with Christoph that we should do exactly the same for all.
 >>>
 >>> Hannes, My understanding here is that you want the device to go away
 >>> after the last path disappeared because it breaks md, why don't you
 >>> want to have this also for fabrics?
 >>>
 >> Oh, I would _love_ to have it for fabrics per default.
 >> If you agree with it I can resend Keiths original patch, which solves
 >> the issue without the need for any additional settings.
 >
 > I think that this is what we should do. Care to resend this for
 > both fabrics and pci?

So what is it now?
Should I go back to the previous patch series?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-04 13:40   ` Hannes Reinecke
@ 2021-05-04 19:54     ` Sagi Grimberg
  2021-05-05 15:26       ` Keith Busch
  2021-05-06  7:43       ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Sagi Grimberg @ 2021-05-04 19:54 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Keith Busch, Daniel Wagner


>> As stated in the v3 review this is an incompatible change.  We'll need
>> the queue_if_no_path attribute first, and default it to on to keep
>> compatability.
>>
> 
> That is what I tried the last time, but the direction I got was to treat 
> both, NVMe-PCI and NVMe-oF identically:
> (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/) 

Yes, I'm not sure I understand your comment Christoph. This addresses an
issue with mdraid where hot unplug+replug does not restore the device to
the raid group (pci and fabrics alike), where before multipath this used
to work.

queue_if_no_path is a dm-multipath feature so I'm not entirely clear
what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
way, with the introduction of nvme-mpath the behavior was broken (as far
as I understand from Hannes).

My thinking is that if we want queue_if_no_path functionality in nvme
mpath we should have it explicitly stated properly such that people
that actually need it will use it and have mdraid function correctly
again. Also, queue_if_no_path applies really if all the paths are
gone in the sense they are completely removed, and doesn't apply
to controller reset/reconnect.

I agree we should probably have queue_if_no_path attribute on the
mpath device, but it doesn't sound right to default it to true given
that it breaks mdraid stacking on top of it..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-04 19:54     ` Sagi Grimberg
@ 2021-05-05 15:26       ` Keith Busch
  2021-05-05 16:15         ` Hannes Reinecke
  2021-05-06  7:43       ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2021-05-05 15:26 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Daniel Wagner

On Tue, May 04, 2021 at 12:54:14PM -0700, Sagi Grimberg wrote:
> 
> > > As stated in the v3 review this is an incompatible change.  We'll need
> > > the queue_if_no_path attribute first, and default it to on to keep
> > > compatability.
> > > 
> > 
> > That is what I tried the last time, but the direction I got was to treat
> > both, NVMe-PCI and NVMe-oF identically:
> > (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/)
> 
> Yes, I'm not sure I understand your comment Christoph. This addresses an
> issue with mdraid where hot unplug+replug does not restore the device to
> the raid group (pci and fabrics alike), where before multipath this used
> to work.
> 
> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
> what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
> way, with the introduction of nvme-mpath the behavior was broken (as far
> as I understand from Hannes).
> 
> My thinking is that if we want queue_if_no_path functionality in nvme
> mpath we should have it explicitly stated properly such that people
> that actually need it will use it and have mdraid function correctly
> again. Also, queue_if_no_path applies really if all the paths are
> gone in the sense they are completely removed, and doesn't apply
> to controller reset/reconnect.
> 
> I agree we should probably have queue_if_no_path attribute on the
> mpath device, but it doesn't sound right to default it to true given
> that it breaks mdraid stacking on top of it..

If you want "queue_if_no_path" behavior, can't you just set really high
reconnect_delay and ctrl_loss_tmo values? That prevents the path from
being deleted while it is unreachable, then restart IO on the existing
path once connection is re-established.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-05 15:26       ` Keith Busch
@ 2021-05-05 16:15         ` Hannes Reinecke
  2021-05-05 20:40           ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-05 16:15 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Daniel Wagner

On 5/5/21 5:26 PM, Keith Busch wrote:
> On Tue, May 04, 2021 at 12:54:14PM -0700, Sagi Grimberg wrote:
>>
>>>> As stated in the v3 review this is an incompatible change.  We'll need
>>>> the queue_if_no_path attribute first, and default it to on to keep
>>>> compatability.
>>>>
>>>
>>> That is what I tried the last time, but the direction I got was to treat
>>> both, NVMe-PCI and NVMe-oF identically:
>>> (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/)
>>
>> Yes, I'm not sure I understand your comment Christoph. This addresses an
>> issue with mdraid where hot unplug+replug does not restore the device to
>> the raid group (pci and fabrics alike), where before multipath this used
>> to work.
>>
>> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
>> what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
>> way, with the introduction of nvme-mpath the behavior was broken (as far
>> as I understand from Hannes).
>>
>> My thinking is that if we want queue_if_no_path functionality in nvme
>> mpath we should have it explicitly stated properly such that people
>> that actually need it will use it and have mdraid function correctly
>> again. Also, queue_if_no_path applies really if all the paths are
>> gone in the sense they are completely removed, and doesn't apply
>> to controller reset/reconnect.
>>
>> I agree we should probably have queue_if_no_path attribute on the
>> mpath device, but it doesn't sound right to default it to true given
>> that it breaks mdraid stacking on top of it..
> 
> If you want "queue_if_no_path" behavior, can't you just set really high
> reconnect_delay and ctrl_loss_tmo values? That prevents the path from
> being deleted while it is unreachable, then restart IO on the existing
> path once connection is re-established.
> 
Precisely my thinking.
We _could_ add a queue_if_no_path attribute, but we can also achieve the
same behaviour by setting the ctrl_loss_tmo value to infinity.
Provided we can change it on the fly, though; but it not that's easily
fixed.

In fact, that's what we recommend to our customers to avoid the bug
fixed by this patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-05 16:15         ` Hannes Reinecke
@ 2021-05-05 20:40           ` Sagi Grimberg
  2021-05-06  2:50             ` Keith Busch
  2021-05-06  6:13             ` Hannes Reinecke
  0 siblings, 2 replies; 21+ messages in thread
From: Sagi Grimberg @ 2021-05-05 20:40 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Daniel Wagner


>>>>> As stated in the v3 review this is an incompatible change.  We'll need
>>>>> the queue_if_no_path attribute first, and default it to on to keep
>>>>> compatability.
>>>>>
>>>>
>>>> That is what I tried the last time, but the direction I got was to treat
>>>> both, NVMe-PCI and NVMe-oF identically:
>>>> (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/)
>>>
>>> Yes, I'm not sure I understand your comment Christoph. This addresses an
>>> issue with mdraid where hot unplug+replug does not restore the device to
>>> the raid group (pci and fabrics alike), where before multipath this used
>>> to work.
>>>
>>> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
>>> what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
>>> way, with the introduction of nvme-mpath the behavior was broken (as far
>>> as I understand from Hannes).
>>>
>>> My thinking is that if we want queue_if_no_path functionality in nvme
>>> mpath we should have it explicitly stated properly such that people
>>> that actually need it will use it and have mdraid function correctly
>>> again. Also, queue_if_no_path applies really if all the paths are
>>> gone in the sense they are completely removed, and doesn't apply
>>> to controller reset/reconnect.
>>>
>>> I agree we should probably have queue_if_no_path attribute on the
>>> mpath device, but it doesn't sound right to default it to true given
>>> that it breaks mdraid stacking on top of it..
>>
>> If you want "queue_if_no_path" behavior, can't you just set really high
>> reconnect_delay and ctrl_loss_tmo values? That prevents the path from
>> being deleted while it is unreachable, then restart IO on the existing
>> path once connection is re-established.
>>
> Precisely my thinking.
> We _could_ add a queue_if_no_path attribute, but we can also achieve the
> same behaviour by setting the ctrl_loss_tmo value to infinity.
> Provided we can change it on the fly, though; but it not that's easily
> fixed.
> 
> In fact, that's what we recommend to our customers to avoid the bug
> fixed by this patch.

You can change ctrl_loss_tmo on the fly. How does that address the
issue? the original issue is ctrl_loss_tmo expires for fabrics? or
pci unplug (which ctrl_loss_tmo does not apply to it)?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-05 20:40           ` Sagi Grimberg
@ 2021-05-06  2:50             ` Keith Busch
  2021-05-06  6:13             ` Hannes Reinecke
  1 sibling, 0 replies; 21+ messages in thread
From: Keith Busch @ 2021-05-06  2:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Daniel Wagner

On Wed, May 05, 2021 at 01:40:29PM -0700, Sagi Grimberg wrote:
> 
> > > > > > As stated in the v3 review this is an incompatible change.  We'll need
> > > > > > the queue_if_no_path attribute first, and default it to on to keep
> > > > > > compatability.
> > > > > > 
> > > > > 
> > > > > That is what I tried the last time, but the direction I got was to treat
> > > > > both, NVMe-PCI and NVMe-oF identically:
> > > > > (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/)
> > > > 
> > > > Yes, I'm not sure I understand your comment Christoph. This addresses an
> > > > issue with mdraid where hot unplug+replug does not restore the device to
> > > > the raid group (pci and fabrics alike), where before multipath this used
> > > > to work.
> > > > 
> > > > queue_if_no_path is a dm-multipath feature so I'm not entirely clear
> > > > what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
> > > > way, with the introduction of nvme-mpath the behavior was broken (as far
> > > > as I understand from Hannes).
> > > > 
> > > > My thinking is that if we want queue_if_no_path functionality in nvme
> > > > mpath we should have it explicitly stated properly such that people
> > > > that actually need it will use it and have mdraid function correctly
> > > > again. Also, queue_if_no_path applies really if all the paths are
> > > > gone in the sense they are completely removed, and doesn't apply
> > > > to controller reset/reconnect.
> > > > 
> > > > I agree we should probably have queue_if_no_path attribute on the
> > > > mpath device, but it doesn't sound right to default it to true given
> > > > that it breaks mdraid stacking on top of it..
> > > 
> > > If you want "queue_if_no_path" behavior, can't you just set really high
> > > reconnect_delay and ctrl_loss_tmo values? That prevents the path from
> > > being deleted while it is unreachable, then restart IO on the existing
> > > path once connection is re-established.
> > > 
> > Precisely my thinking.
> > We _could_ add a queue_if_no_path attribute, but we can also achieve the
> > same behaviour by setting the ctrl_loss_tmo value to infinity.
> > Provided we can change it on the fly, though; but it not that's easily
> > fixed.
> > 
> > In fact, that's what we recommend to our customers to avoid the bug
> > fixed by this patch.
> 
> You can change ctrl_loss_tmo on the fly. How does that address the
> issue? the original issue is ctrl_loss_tmo expires for fabrics? or
> pci unplug (which ctrl_loss_tmo does not apply to it)?

PCI aside, the ctrl_loss_tmo parameter can be used to have IO queue on
the head's bio_list whilst no path is available. That's what
"queue_if_no_path" would provide too, right?

The main difference AFAICS is that "ctrl_loss_tmo" applies to individual
paths, where "queue_if_no_path" would apply to the "head", but I'm not
sure that detail matters.

Regarding PCI, we could add the property there, but disconnecting a
local PCIe device does seem like a different scenario than losing
connection to a remote target.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-05 20:40           ` Sagi Grimberg
  2021-05-06  2:50             ` Keith Busch
@ 2021-05-06  6:13             ` Hannes Reinecke
  1 sibling, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-06  6:13 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, Daniel Wagner

On 5/5/21 10:40 PM, Sagi Grimberg wrote:
> 
>>>>>> As stated in the v3 review this is an incompatible change.  We'll 
>>>>>> need
>>>>>> the queue_if_no_path attribute first, and default it to on to keep
>>>>>> compatability.
>>>>>>
>>>>>
>>>>> That is what I tried the last time, but the direction I got was to 
>>>>> treat
>>>>> both, NVMe-PCI and NVMe-oF identically:
>>>>> (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/) 
>>>>>
>>>>
>>>> Yes, I'm not sure I understand your comment Christoph. This 
>>>> addresses an
>>>> issue with mdraid where hot unplug+replug does not restore the 
>>>> device to
>>>> the raid group (pci and fabrics alike), where before multipath this 
>>>> used
>>>> to work.
>>>>
>>>> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
>>>> what is the concern? mdraid on nvme (pci/fabrics) used to work a 
>>>> certain
>>>> way, with the introduction of nvme-mpath the behavior was broken (as 
>>>> far
>>>> as I understand from Hannes).
>>>>
>>>> My thinking is that if we want queue_if_no_path functionality in nvme
>>>> mpath we should have it explicitly stated properly such that people
>>>> that actually need it will use it and have mdraid function correctly
>>>> again. Also, queue_if_no_path applies really if all the paths are
>>>> gone in the sense they are completely removed, and doesn't apply
>>>> to controller reset/reconnect.
>>>>
>>>> I agree we should probably have queue_if_no_path attribute on the
>>>> mpath device, but it doesn't sound right to default it to true given
>>>> that it breaks mdraid stacking on top of it..
>>>
>>> If you want "queue_if_no_path" behavior, can't you just set really high
>>> reconnect_delay and ctrl_loss_tmo values? That prevents the path from
>>> being deleted while it is unreachable, then restart IO on the existing
>>> path once connection is re-established.
>>>
>> Precisely my thinking.
>> We _could_ add a queue_if_no_path attribute, but we can also achieve the
>> same behaviour by setting the ctrl_loss_tmo value to infinity.
>> Provided we can change it on the fly, though; but it not that's easily
>> fixed.
>>
>> In fact, that's what we recommend to our customers to avoid the bug
>> fixed by this patch.
> 
> You can change ctrl_loss_tmo on the fly. How does that address the
> issue? the original issue is ctrl_loss_tmo expires for fabrics? or
> pci unplug (which ctrl_loss_tmo does not apply to it)?

Yes. It becomes particularly noticeable in TCP fabrics where the link 
can go down for an extended time.
The system will try to reconnect until ctrl_loss_tmo kicks in; if the 
link gets reestablished after that time your system is hosed.

With this patch I/O is still killed, but at least you can then 
re-establish the connection by just calling

nvme connect

and the nvme device will be reconnected such that you can call

mdadm --re-add

to resync the device.
With the current implementation you are out of luck as I/O is pending on 
the disconnected original nvme device, and you have no chance to flush 
that. Consequently you can't detach it from the MD array, and, again, 
your system is hosed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-04 19:54     ` Sagi Grimberg
  2021-05-05 15:26       ` Keith Busch
@ 2021-05-06  7:43       ` Christoph Hellwig
  2021-05-06  8:42         ` Hannes Reinecke
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-06  7:43 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Keith Busch, Daniel Wagner

On Tue, May 04, 2021 at 12:54:14PM -0700, Sagi Grimberg wrote:
> Yes, I'm not sure I understand your comment Christoph. This addresses an
> issue with mdraid where hot unplug+replug does not restore the device to
> the raid group (pci and fabrics alike), where before multipath this used
> to work.
>
> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
> what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
> way, with the introduction of nvme-mpath the behavior was broken (as far
> as I understand from Hannes).

AFAIK that specific mdraid behavior is also fixed by the uevent patch
he sent.

> My thinking is that if we want queue_if_no_path functionality in nvme
> mpath we should have it explicitly stated properly such that people
> that actually need it will use it and have mdraid function correctly
> again. Also, queue_if_no_path applies really if all the paths are
> gone in the sense they are completely removed, and doesn't apply
> to controller reset/reconnect.
>
> I agree we should probably have queue_if_no_path attribute on the
> mpath device, but it doesn't sound right to default it to true given
> that it breaks mdraid stacking on top of it..

I really do not think we should change the mpath behaviors years after
first adding it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-06  7:43       ` Christoph Hellwig
@ 2021-05-06  8:42         ` Hannes Reinecke
  2021-05-06  9:47           ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-06  8:42 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/6/21 9:43 AM, Christoph Hellwig wrote:
> On Tue, May 04, 2021 at 12:54:14PM -0700, Sagi Grimberg wrote:
>> Yes, I'm not sure I understand your comment Christoph. This addresses an
>> issue with mdraid where hot unplug+replug does not restore the device to
>> the raid group (pci and fabrics alike), where before multipath this used
>> to work.
>>
>> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
>> what is the concern? mdraid on nvme (pci/fabrics) used to work a certain
>> way, with the introduction of nvme-mpath the behavior was broken (as far
>> as I understand from Hannes).
> 
> AFAIK that specific mdraid behavior is also fixed by the uevent patch
> he sent.
> 
It is most emphatically _NOT_.

These two patches are complementary.

To rephrase: with the current behaviour MD is completely hosed once one
NVMe-oF device get removed after ctrl_loss_tmo kicks in.

And _nothing_ will fix that except a system reboot.

_That_ is the issue this patch fixes.

The other patch for sending the uevent is just to tell MD that recovery
can start. But recovery _cannot_ start without this patch.

> 
> I really do not think we should change the mpath behaviors years after
> first adding it.
> 

But only because no-one ever tested MD on nvme-multipath.
It has been broken since day 1.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-06  8:42         ` Hannes Reinecke
@ 2021-05-06  9:47           ` Sagi Grimberg
  2021-05-06 12:08             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2021-05-06  9:47 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Keith Busch, Daniel Wagner


>> I really do not think we should change the mpath behaviors years after
>> first adding it.
>>
> 
> But only because no-one ever tested MD on nvme-multipath.
> It has been broken since day 1.

That is my assumption as well, and its a problem that by default
a pci device unplugs or a fabrics controller is lost and removed
the stacked mdraid on top stops functioning.

I still think we can fix that by changing the semantics of implicit
queue_if_no_path and restore it as an explicit opt-in.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-06  9:47           ` Sagi Grimberg
@ 2021-05-06 12:08             ` Christoph Hellwig
  2021-05-06 15:54               ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-06 12:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	Keith Busch, Daniel Wagner

On Thu, May 06, 2021 at 02:47:00AM -0700, Sagi Grimberg wrote:
> That is my assumption as well, and its a problem that by default
> a pci device unplugs or a fabrics controller is lost and removed
> the stacked mdraid on top stops functioning.
>
> I still think we can fix that by changing the semantics of implicit
> queue_if_no_path and restore it as an explicit opt-in.

So at very least we need to make sure we have the opt-in available before
changing the defaults.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-06 12:08             ` Christoph Hellwig
@ 2021-05-06 15:54               ` Hannes Reinecke
  2021-05-07  6:46                 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-06 15:54 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/6/21 2:08 PM, Christoph Hellwig wrote:
> On Thu, May 06, 2021 at 02:47:00AM -0700, Sagi Grimberg wrote:
>> That is my assumption as well, and its a problem that by default
>> a pci device unplugs or a fabrics controller is lost and removed
>> the stacked mdraid on top stops functioning.
>>
>> I still think we can fix that by changing the semantics of implicit
>> queue_if_no_path and restore it as an explicit opt-in.
> 
> So at very least we need to make sure we have the opt-in available before
> changing the defaults.
> 

But what _is_ the default?
PCI and fabrics have different defaults; for PCI the device goes away if 
the last path (ie the controller) goes away, for fabrics it doesn't if 
the device is mounted.

Do we want to keep the difference in behaviour?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-06 15:54               ` Hannes Reinecke
@ 2021-05-07  6:46                 ` Christoph Hellwig
  2021-05-07 17:02                   ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-07  6:46 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Keith Busch, Daniel Wagner

On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
> PCI and fabrics have different defaults; for PCI the device goes away if 
> the last path (ie the controller) goes away, for fabrics it doesn't if the 
> device is mounted.

Err, no.  For fabrics we reconnect a while, but otherwise the behavior
is the same right now.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-07  6:46                 ` Christoph Hellwig
@ 2021-05-07 17:02                   ` Hannes Reinecke
  2021-05-07 17:20                     ` Sagi Grimberg
  2021-05-10  6:23                     ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-07 17:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/7/21 8:46 AM, Christoph Hellwig wrote:
> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>> PCI and fabrics have different defaults; for PCI the device goes away if
>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>> device is mounted.
> 
> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
> is the same right now.
> 
No, that is not the case.

When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), 
the nvme device is completely removed, irrespective on whether it's 
mounted or not.
When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie 
the nsnhead) will _stay_ when mounted (as the refcount is not zero).

This can be easily demonstrated on qemu; just set the 'subsys' parameter 
for the nvme device.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-07 17:02                   ` Hannes Reinecke
@ 2021-05-07 17:20                     ` Sagi Grimberg
  2021-05-10  6:23                     ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2021-05-07 17:20 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Keith Busch, Daniel Wagner


>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>> the last path (ie the controller) goes away, for fabrics it doesn't 
>>> if the
>>> device is mounted.
>>
>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>> is the same right now.
>>
> No, that is not the case.
> 
> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), 
> the nvme device is completely removed, irrespective on whether it's 
> mounted or not.
> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie 
> the nsnhead) will _stay_ when mounted (as the refcount is not zero).
> 
> This can be easily demonstrated on qemu; just set the 'subsys' parameter 
> for the nvme device.

Perhaps we need to fix the issue and allow the existing behavior by
having an explicit queue_if_no_path argument.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-07 17:02                   ` Hannes Reinecke
  2021-05-07 17:20                     ` Sagi Grimberg
@ 2021-05-10  6:23                     ` Christoph Hellwig
  2021-05-10 13:01                       ` Hannes Reinecke
  2021-05-10 14:48                       ` Hannes Reinecke
  1 sibling, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2021-05-10  6:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Keith Busch, Daniel Wagner

On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>> device is mounted.
>>
>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>> is the same right now.
>>
> No, that is not the case.
>
> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the 
> nvme device is completely removed, irrespective on whether it's mounted or 
> not.
> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the 
> nsnhead) will _stay_ when mounted (as the refcount is not zero).

Yes.  But that has nothing to do with fabrics as you claimed above, but
with the fact if the subsystem supports multiple controller (and thus
shared namespaces) or not.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-10  6:23                     ` Christoph Hellwig
@ 2021-05-10 13:01                       ` Hannes Reinecke
  2021-05-10 13:57                         ` Hannes Reinecke
  2021-05-10 14:48                       ` Hannes Reinecke
  1 sibling, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-10 13:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/10/21 8:23 AM, Christoph Hellwig wrote:
> On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
>> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>>> device is mounted.
>>>
>>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>>> is the same right now.
>>>
>> No, that is not the case.
>>
>> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the 
>> nvme device is completely removed, irrespective on whether it's mounted or 
>> not.
>> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the 
>> nsnhead) will _stay_ when mounted (as the refcount is not zero).
> 
> Yes.  But that has nothing to do with fabrics as you claimed above, but
> with the fact if the subsystem supports multiple controller (and thus
> shared namespaces) or not.
> 
It's still broken, though.

I've setup a testbed to demonstrate what I mean.

I have created a qemu instance with 3 NVMe devices, one for booting and
two for MD RAID.
After boot, MD RAID says this:

 # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk


Now I detach the PCI device for /dev/nvme3:

(qemu) device_del nvme-rp90
(qemu) [  183.512585] pcieport 0000:00:09.0: pciehp: Slot(0-2):
Attention button pressed
[  183.515462] pcieport 0000:00:09.0: pciehp: Slot(0-2): Powering off
due to button press

And validate that the device is gone:

# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:08.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:09.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6
port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller
(rev 02)
01:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)
02:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)

Checking MD I still get:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

IE MD hasn't registered _anything_, even though the device is physically
not present anymore.
And to make matters worse, 'nvme list' says:

# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     |U                   b��||U
       -1          0.00   B /   0.00   B      1   B +  0 B   ���||U

which arguably is a bug in itself, as we shouldn't display weird strings
here. But no matter.

Now I'm reattaching the PCI device:

(qemu) device_add nvme,bus=rp90,id=nvme-rp90,subsys=subsys3
[   49.261163] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[   49.263915] pcieport 0000:00:09.0: pciehp: Slot(0-2) Powering on due
to button press
[   49.267188] pcieport 0000:00:09.0: pciehp: Slot(0-2): Card present
[   49.269505] pcieport 0000:00:09.0: pciehp: Slot(0-2): Link Up
[   49.406035] pci 0000:03:00.0: [1b36:0010] type 00 class 0x010802
[   49.411585] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[   49.417627] pci 0000:03:00.0: BAR 0: assigned [mem
0xc1000000-0xc1003fff 64bit]
[   49.421057] pcieport 0000:00:09.0: PCI bridge to [bus 03]
[   49.424071] pcieport 0000:00:09.0:   bridge window [io  0x6000-0x6fff]
[   49.428157] pcieport 0000:00:09.0:   bridge window [mem
0xc1000000-0xc11fffff]
[   49.431379] pcieport 0000:00:09.0:   bridge window [mem
0x804000000-0x805ffffff 64bit pref]
[   49.436591] nvme nvme3: pci function 0000:03:00.0
[   49.438303] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[   49.446746] nvme nvme3: 1/0/0 default/read/poll queues
(qemu) device_add nvme-ns,bus=nvme-rp90,drive=nvme-3,nsid=1
[   64.781295] nvme nvme3: rescanning namespaces.
[   64.806720] block nvme2n1: no available path - failing I/O

And I'm ending up with _4_ namespaces:
# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     SLESNVME3            QEMU NVMe Ctrl
       -1          0.00   B /   0.00   B      1   B +  0 B   1.0
/dev/nvme2n2     SLESNVME3            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0

and MD is still referencing the original one:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

when doing I/O MD will finally figure out that something is amiss:

[  152.636007] block nvme2n1: no available path - failing I/O
[  152.641562] block nvme2n1: no available path - failing I/O
[  152.645454] md: super_written gets error=-5
[  152.648799] md/raid1:md1: Disk failure on nvme2n1, disabling device.
[  152.648799] md/raid1:md1: Operation continuing on 1 devices.

but we're left with the problem that the re-attached namespace now has a
different device name (/dev/nvme2n2 vs /dev/nvme2n1), so MD cannot
reattach the device seamlessly and needs manual intervention.

Note: this has been tested with latest nvme-5.13, and the situation has
improved somewhat (as compared to previous versions) by the fact that MD
is now able to recover after manual interaction.
But we still end up with a namespace with the wrong name.

Cheers,

Hannes

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-10 13:01                       ` Hannes Reinecke
@ 2021-05-10 13:57                         ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-10 13:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/10/21 3:01 PM, Hannes Reinecke wrote:
[ .. ]
> Note: this has been tested with latest nvme-5.13, and the situation has
> improved somewhat (as compared to previous versions) by the fact that MD
> is now able to recover after manual interaction.
> But we still end up with a namespace with the wrong name.
> 

... and it's actually _not_ what I was planning to demonstrate.
Latest qemu emulates NVMe with CMIC=2, so we never actually engage
multipathing.

So scratch that; I'll need to redo the test.

Sigh.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCHv3] nvme-mpath: delete disk after last connection
  2021-05-10  6:23                     ` Christoph Hellwig
  2021-05-10 13:01                       ` Hannes Reinecke
@ 2021-05-10 14:48                       ` Hannes Reinecke
  1 sibling, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-05-10 14:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Keith Busch, Daniel Wagner

On 5/10/21 8:23 AM, Christoph Hellwig wrote:
> On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
>> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>>> device is mounted.
>>>
>>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>>> is the same right now.
>>>
>> No, that is not the case.
>>
>> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the 
>> nvme device is completely removed, irrespective on whether it's mounted or 
>> not.
>> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the 
>> nsnhead) will _stay_ when mounted (as the refcount is not zero).
> 
> Yes.  But that has nothing to do with fabrics as you claimed above, but
> with the fact if the subsystem supports multiple controller (and thus
> shared namespaces) or not.
> 
So, I seem to have reproduced the issue with latest nvme-5.13.
The failure pattern is slightly different, so I think I've been able to
solve is in a slightly less controversial manner.

Patch to follow.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-05-10 14:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-01 12:04 [PATCHv3] nvme-mpath: delete disk after last connection Hannes Reinecke
2021-05-04  8:54 ` Christoph Hellwig
2021-05-04 13:40   ` Hannes Reinecke
2021-05-04 19:54     ` Sagi Grimberg
2021-05-05 15:26       ` Keith Busch
2021-05-05 16:15         ` Hannes Reinecke
2021-05-05 20:40           ` Sagi Grimberg
2021-05-06  2:50             ` Keith Busch
2021-05-06  6:13             ` Hannes Reinecke
2021-05-06  7:43       ` Christoph Hellwig
2021-05-06  8:42         ` Hannes Reinecke
2021-05-06  9:47           ` Sagi Grimberg
2021-05-06 12:08             ` Christoph Hellwig
2021-05-06 15:54               ` Hannes Reinecke
2021-05-07  6:46                 ` Christoph Hellwig
2021-05-07 17:02                   ` Hannes Reinecke
2021-05-07 17:20                     ` Sagi Grimberg
2021-05-10  6:23                     ` Christoph Hellwig
2021-05-10 13:01                       ` Hannes Reinecke
2021-05-10 13:57                         ` Hannes Reinecke
2021-05-10 14:48                       ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.