All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: avoid hang on inaccessible paths
@ 2018-05-30 11:16 Hannes Reinecke
  2018-05-30 12:12 ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2018-05-30 11:16 UTC (permalink / raw)


If all discovered paths are inaccessible we need to terminate the I/O,
otherwise we'll have in revalidate_disk() and the connect call will
never succeed.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/multipath.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 74bb5b737879..fa3a4efe288e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -150,7 +150,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 	ns = __nvme_find_path(head, NVME_ANA_OPTIMIZED);
 	if (!ns)
 		ns = __nvme_find_path(head, NVME_ANA_NONOPTIMIZED);
-	/* XXX: try an inaccessible path as last resort per 8.18.3.3 */
+	if (!ns)
+		ns = __nvme_find_path(head, NVME_ANA_INACCESSIBLE);
 	return ns;
 }
 
@@ -165,10 +166,14 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
-	if (likely(ns)) {
+	if (likely(ns && nvme_ns_ana_state(ns) != NVME_ANA_INACCESSIBLE)) {
 		bio->bi_disk = ns->disk;
 		bio->bi_opf |= REQ_NVME_MPATH;
 		ret = direct_make_request(bio);
+	} else if (ns) {
+		bio->bi_status = BLK_STS_TRANSPORT;
+		bio_set_flag(bio, BIO_QUIET);
+		bio_endio(bio);
 	} else if (!list_empty_careful(&head->list)) {
 		dev_warn_ratelimited(dev, "no path available - requeuing I/O\n");
 
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 11:16 [PATCH] nvme: avoid hang on inaccessible paths Hannes Reinecke
@ 2018-05-30 12:12 ` Christoph Hellwig
  2018-05-30 12:30   ` Hannes Reinecke
  2018-06-06 19:02   ` Popuri, Sriram
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2018-05-30 12:12 UTC (permalink / raw)


> -	/* XXX: try an inaccessible path as last resort per 8.18.3.3 */
> +	if (!ns)
> +		ns = __nvme_find_path(head, NVME_ANA_INACCESSIBLE);

This at least needs to keep a comment on why we are aiming for an
inaccessible path.

> @@ -165,10 +166,14 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
>  
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
> -	if (likely(ns)) {
> +	if (likely(ns && nvme_ns_ana_state(ns) != NVME_ANA_INACCESSIBLE)) {
>  		bio->bi_disk = ns->disk;
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		ret = direct_make_request(bio);

We should actually sometimes try to issue I/O on an inaccessible path.
Please take a look at Section 8.18.3.3 in the ANA TP.  Similar handling
would also apply to change states, which we'd also need to cover here
and above.


> +	} else if (ns) {
> +		bio->bi_status = BLK_STS_TRANSPORT;
> +		bio_set_flag(bio, BIO_QUIET);
> +		bio_endio(bio);
>  	} else if (!list_empty_careful(&head->list)) {
>  		dev_warn_ratelimited(dev, "no path available - requeuing I/O\n");

But a case where we only have inaccessible / change states isn't
really different from the no path available, requeing case.

Now it might make sense to have a (configurable) timeout to give
up in all those case, and if my vague memory serves me right you
actually volunteered to implement that a while ago.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 12:12 ` Christoph Hellwig
@ 2018-05-30 12:30   ` Hannes Reinecke
  2018-05-30 12:54     ` Christoph Hellwig
  2018-06-06 19:02   ` Popuri, Sriram
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2018-05-30 12:30 UTC (permalink / raw)


On Wed, 30 May 2018 14:12:41 +0200
Christoph Hellwig <hch@lst.de> wrote:

> > -	/* XXX: try an inaccessible path as last resort per
> > 8.18.3.3 */
> > +	if (!ns)
> > +		ns = __nvme_find_path(head,
> > NVME_ANA_INACCESSIBLE);  
> 
> This at least needs to keep a comment on why we are aiming for an
> inaccessible path.
> 
Ok.

> > @@ -165,10 +166,14 @@ static blk_qc_t
> > nvme_ns_head_make_request(struct request_queue *q, 
> >  	srcu_idx = srcu_read_lock(&head->srcu);
> >  	ns = nvme_find_path(head);
> > -	if (likely(ns)) {
> > +	if (likely(ns && nvme_ns_ana_state(ns) !=
> > NVME_ANA_INACCESSIBLE)) { bio->bi_disk = ns->disk;
> >  		bio->bi_opf |= REQ_NVME_MPATH;
> >  		ret = direct_make_request(bio);  
> 
> We should actually sometimes try to issue I/O on an inaccessible path.
> Please take a look at Section 8.18.3.3 in the ANA TP.  Similar
> handling would also apply to change states, which we'd also need to
> cover here and above.
> 
That specific handling hasn't been implemented with this patch, as it
would induce issues on the host side.
But see below.

> 
> > +	} else if (ns) {
> > +		bio->bi_status = BLK_STS_TRANSPORT;
> > +		bio_set_flag(bio, BIO_QUIET);
> > +		bio_endio(bio);
> >  	} else if (!list_empty_careful(&head->list)) {
> >  		dev_warn_ratelimited(dev, "no path available -
> > requeuing I/O\n");  
> 
> But a case where we only have inaccessible / change states isn't
> really different from the no path available, requeing case.
> 
It is for the initial connect.

> Now it might make sense to have a (configurable) timeout to give
> up in all those case, and if my vague memory serves me right you
> actually volunteered to implement that a while ago.
> 
The problem is that this particular code path is triggered for the
revalidate_disk() case; when opting for requeue (as the original code
did) the system will hang during revalidate_disk(), via

[ 1463.219509]  schedule+0x4f/0xd0
[ 1463.220712]  io_schedule+0x1c/0x50
[ 1463.221415]  do_read_cache_page+0x603/0x8a0
[ 1463.232007]  read_dev_sector+0x5b/0x160
[ 1463.232858]  read_lba+0x1b5/0x340
[ 1463.237400]  efi_partition+0x234/0xce0
[ 1463.254255]  check_partition+0x1ab/0x310
[ 1463.255090]  rescan_partitions+0x136/0x4b0
[ 1463.258420]  __blkdev_get+0x535/0x840
[ 1463.261793]  blkdev_get+0x1ff/0x4c0
[ 1463.267360]  __device_add_disk+0x73c/0x7a0
[ 1463.272179]  nvme_mpath_add_disk+0x9a/0xe0
[ 1463.273042]  nvme_alloc_ns+0x6e3/0xdb0

As this blocks the nvmf_dev_mutex we don't have any chance to connect
the other, optimized path.

I had been thinking of implementing that particular handling from the
ANA spec, but that would mean we're adding an ANA TT delay for each
inaccessible path, which with the current defaults would mean booting
is delayed by 10 seconds per inaccessible path.
That simply doesn't scale, so I opted for aborting the I/O straigh away.
(Which, incidentally, is also how we handle things on the dm-multipath
side).

One possible alternative would be to make revalidate_disk non-blocking
by adding another GENHD flag to avoid reading partitions, as this would
require more plumbing I didn't attempt it so far.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 12:30   ` Hannes Reinecke
@ 2018-05-30 12:54     ` Christoph Hellwig
  2018-05-30 23:10       ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2018-05-30 12:54 UTC (permalink / raw)


On Wed, May 30, 2018@02:30:20PM +0200, Hannes Reinecke wrote:
> > Now it might make sense to have a (configurable) timeout to give
> > up in all those case, and if my vague memory serves me right you
> > actually volunteered to implement that a while ago.
> > 
> The problem is that this particular code path is triggered for the
> revalidate_disk() case; when opting for requeue (as the original code
> did) the system will hang during revalidate_disk(), via

So we add the first path, which in inaccessible.  So yes, the I/O
should block for now (that is until we have a timeout for that)

> As this blocks the nvmf_dev_mutex we don't have any chance to connect
> the other, optimized path.

And I guess this where the real issue starts.  We should not do block
I/O under nvmf_dev_mutex.

James has some work on asynchronous connects pending for FC, so I guess
we could look into generalizing that and always exectute the real connect
work in a different thread.

Or we could move create_ctrl outside the lock, which Johannes had
patch for a few days ago, although that didn't work as-is.  I'll attach
my completely untested attempt at that below.

> I had been thinking of implementing that particular handling from the
> ANA spec, but that would mean we're adding an ANA TT delay for each
> inaccessible path, which with the current defaults would mean booting
> is delayed by 10 seconds per inaccessible path.

No.  We should only retry on an inaccessible or change path if we
don't have an optimized or non-optimized path available.

---
>From bf729292705351639aa1eac1bde2e1afc25f11b7 Mon Sep 17 00:00:00 2001
From: Johannes Thumshirn <jthumshirn@suse.de>
Date: Wed, 30 May 2018 14:46:39 +0200
Subject: nvme: don't hold nvmf_transports_rwsem for more than transport
 lookups

Only take nvmf_transports_rwsem when doing a lookup of registered
transports, so that a blocking ->create_ctrl doesn't prevent other
actions on /dev/nvme-fabrics.

Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
[hch: increased lock hold time a bit to be safe, added a comment
 and updated the changelog]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/fabrics.c | 3 ++-
 drivers/nvme/host/fabrics.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index aa318136460e..5f9618a2fd3d 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -969,6 +969,7 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		ret = -EBUSY;
 		goto out_unlock;
 	}
+	up_read(&nvmf_transports_rwsem);
 
 	ret = nvmf_check_required_opts(opts, ops->required_opts);
 	if (ret)
@@ -985,11 +986,11 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 	}
 
 	module_put(ops->module);
-	up_read(&nvmf_transports_rwsem);
 	return ctrl;
 
 out_module_put:
 	module_put(ops->module);
+	goto out_free_opts;
 out_unlock:
 	up_read(&nvmf_transports_rwsem);
 out_free_opts:
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index ef46c915b7b5..d778f14bff20 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -124,6 +124,9 @@ struct nvmf_ctrl_options {
  *	1. At minimum, 'required_opts' and 'allowed_opts' should
  *	   be set to the same enum parsing options defined earlier.
  *	2. create_ctrl() must be defined (even if it does nothing)
+ *	3. struct nvmf_transport_ops must be statically allocated in the
+ *	   modules .bss section so that a pure module_get on @module
+ *	   prevents the memory from beeing freed.
  */
 struct nvmf_transport_ops {
 	struct list_head	entry;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 12:54     ` Christoph Hellwig
@ 2018-05-30 23:10       ` Sagi Grimberg
  2018-06-04  6:24         ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2018-05-30 23:10 UTC (permalink / raw)



> And I guess this where the real issue starts.  We should not do block
> I/O under nvmf_dev_mutex.
> 
> James has some work on asynchronous connects pending for FC, so I guess
> we could look into generalizing that and always exectute the real connect
> work in a different thread.

I still need to ramp up on the ANA patches, but I'd prefer moving
create_ctrl outside of nvmf_dev_mutex instead of having it async.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 23:10       ` Sagi Grimberg
@ 2018-06-04  6:24         ` Hannes Reinecke
  2018-06-04  6:37           ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2018-06-04  6:24 UTC (permalink / raw)


On Thu, 31 May 2018 02:10:35 +0300
Sagi Grimberg <sagi@grimberg.me> wrote:

> > And I guess this where the real issue starts.  We should not do
> > block I/O under nvmf_dev_mutex.
> > 
> > James has some work on asynchronous connects pending for FC, so I
> > guess we could look into generalizing that and always exectute the
> > real connect work in a different thread.  
> 
> I still need to ramp up on the ANA patches, but I'd prefer moving
> create_ctrl outside of nvmf_dev_mutex instead of having it async.

We'd need the async patches still, as they provide a way on how we can
call nvme-cli directly from within udev rules.
Without those patches nvme-cli might get stuck (as this issue neatly
proves) and would block udev from processing events.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-06-04  6:24         ` Hannes Reinecke
@ 2018-06-04  6:37           ` Christoph Hellwig
  2018-06-04 12:17             ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2018-06-04  6:37 UTC (permalink / raw)


On Mon, Jun 04, 2018@08:24:55AM +0200, Hannes Reinecke wrote:
> On Thu, 31 May 2018 02:10:35 +0300
> Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> > > And I guess this where the real issue starts.  We should not do
> > > block I/O under nvmf_dev_mutex.
> > > 
> > > James has some work on asynchronous connects pending for FC, so I
> > > guess we could look into generalizing that and always exectute the
> > > real connect work in a different thread.  
> > 
> > I still need to ramp up on the ANA patches, but I'd prefer moving
> > create_ctrl outside of nvmf_dev_mutex instead of having it async.
> 
> We'd need the async patches still, as they provide a way on how we can
> call nvme-cli directly from within udev rules.
> Without those patches nvme-cli might get stuck (as this issue neatly
> proves) and would block udev from processing events.

So just run nvme-cli in the background from udev.  Or if udev really
is too stupid for that have an option to fork and run in the background
in nvme-cli itself..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-06-04  6:37           ` Christoph Hellwig
@ 2018-06-04 12:17             ` Sagi Grimberg
  2018-06-04 12:56               ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2018-06-04 12:17 UTC (permalink / raw)




>> We'd need the async patches still, as they provide a way on how we can
>> call nvme-cli directly from within udev rules.
>> Without those patches nvme-cli might get stuck (as this issue neatly
>> proves) and would block udev from processing events.
> 
> So just run nvme-cli in the background from udev.  Or if udev really
> is too stupid for that have an option to fork and run in the background
> in nvme-cli itself..

 From udev man:

RUN{type}

...
        This can only be used for very short-running foreground tasks.
        Running an event process for a long period of time may block all
        further events for this or a dependent device.

        Starting daemons or other long running processes is not appropriate
        for udev; the forked processes, detached or not, will be
        unconditionally killed after the event handling has finished.

One way around this is to disown the command from udev shell
(e.g. $cmd & disown)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-06-04 12:17             ` Sagi Grimberg
@ 2018-06-04 12:56               ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2018-06-04 12:56 UTC (permalink / raw)


On Mon, Jun 04, 2018@03:17:43PM +0300, Sagi Grimberg wrote:
> From udev man:
>
> RUN{type}
>
> ...
>        This can only be used for very short-running foreground tasks.
>        Running an event process for a long period of time may block all
>        further events for this or a dependent device.
>
>        Starting daemons or other long running processes is not appropriate
>        for udev; the forked processes, detached or not, will be
>        unconditionally killed after the event handling has finished.
>
> One way around this is to disown the command from udev shell
> (e.g. $cmd & disown)

Oh well.  Offloading to a workqueue in the kernel to work around
this just doesn't really sound like the proper solution.  Then again
our connect really should not take longer than a reasonable timeout
anyway..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-05-30 12:12 ` Christoph Hellwig
  2018-05-30 12:30   ` Hannes Reinecke
@ 2018-06-06 19:02   ` Popuri, Sriram
  2018-06-07  5:54     ` Hannes Reinecke
  1 sibling, 1 reply; 11+ messages in thread
From: Popuri, Sriram @ 2018-06-06 19:02 UTC (permalink / raw)


Looks like __nvme_find_path is pretty basic one.
Let's say if there are 2 NVME_ANA_OPTIMIZED paths. However it looks like the first found path is chosen. That means having more than one NVME_ANA_OPTIMIZED paths doesn't help much. Any plans to round robin?

> -	/* XXX: try an inaccessible path as last resort per 8.18.3.3 */
> +	if (!ns)
> +		ns = __nvme_find_path(head, NVME_ANA_INACCESSIBLE);

Also if there are multiple NVME_ANA_INACCESSIBLE paths and only a few(or one) is changing ANA state from INACCESSIBLE to OPTIMIZED and if that happens to be the last entry in your list, then trying above doesn't really help.

Regards,
~Sriram

-----Original Message-----
From: Linux-nvme <linux-nvme-bounces@lists.infradead.org> On Behalf Of Christoph Hellwig
Sent: Wednesday, May 30, 2018 5:43 PM
To: Hannes Reinecke <hare at suse.de>
Cc: Keith Busch <keith.busch at intel.com>; Hannes Reinecke <hare at suse.com>; Christoph Hellwig <hch at lst.de>; linux-nvme at lists.infradead.org; Sagi Grimberg <sagi at grimberg.me>
Subject: Re: [PATCH] nvme: avoid hang on inaccessible paths

> -	/* XXX: try an inaccessible path as last resort per 8.18.3.3 */
> +	if (!ns)
> +		ns = __nvme_find_path(head, NVME_ANA_INACCESSIBLE);

This at least needs to keep a comment on why we are aiming for an inaccessible path.

> @@ -165,10 +166,14 @@ static blk_qc_t nvme_ns_head_make_request(struct 
> request_queue *q,
>  
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
> -	if (likely(ns)) {
> +	if (likely(ns && nvme_ns_ana_state(ns) != NVME_ANA_INACCESSIBLE)) {
>  		bio->bi_disk = ns->disk;
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		ret = direct_make_request(bio);

We should actually sometimes try to issue I/O on an inaccessible path.
Please take a look at Section 8.18.3.3 in the ANA TP.  Similar handling would also apply to change states, which we'd also need to cover here and above.


> +	} else if (ns) {
> +		bio->bi_status = BLK_STS_TRANSPORT;
> +		bio_set_flag(bio, BIO_QUIET);
> +		bio_endio(bio);
>  	} else if (!list_empty_careful(&head->list)) {
>  		dev_warn_ratelimited(dev, "no path available - requeuing I/O\n");

But a case where we only have inaccessible / change states isn't really different from the no path available, requeing case.

Now it might make sense to have a (configurable) timeout to give up in all those case, and if my vague memory serves me right you actually volunteered to implement that a while ago.

_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] nvme: avoid hang on inaccessible paths
  2018-06-06 19:02   ` Popuri, Sriram
@ 2018-06-07  5:54     ` Hannes Reinecke
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2018-06-07  5:54 UTC (permalink / raw)


On Wed, 6 Jun 2018 19:02:16 +0000
"Popuri, Sriram" <Sriram.Popuri@netapp.com> wrote:

> Looks like __nvme_find_path is pretty basic one.
> Let's say if there are 2 NVME_ANA_OPTIMIZED paths. However it looks
> like the first found path is chosen. That means having more than one
> NVME_ANA_OPTIMIZED paths doesn't help much. Any plans to round robin?
> 
> > -	/* XXX: try an inaccessible path as last resort per
> > 8.18.3.3 */
> > +	if (!ns)
> > +		ns = __nvme_find_path(head,
> > NVME_ANA_INACCESSIBLE);  
> 
> Also if there are multiple NVME_ANA_INACCESSIBLE paths and only a
> few(or one) is changing ANA state from INACCESSIBLE to OPTIMIZED and
> if that happens to be the last entry in your list, then trying above
> doesn't really help.
> 
That patch has been retracted; I'll be sending an updated patch series
for implementing ANATT handling soon.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-06-07  5:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-30 11:16 [PATCH] nvme: avoid hang on inaccessible paths Hannes Reinecke
2018-05-30 12:12 ` Christoph Hellwig
2018-05-30 12:30   ` Hannes Reinecke
2018-05-30 12:54     ` Christoph Hellwig
2018-05-30 23:10       ` Sagi Grimberg
2018-06-04  6:24         ` Hannes Reinecke
2018-06-04  6:37           ` Christoph Hellwig
2018-06-04 12:17             ` Sagi Grimberg
2018-06-04 12:56               ` Christoph Hellwig
2018-06-06 19:02   ` Popuri, Sriram
2018-06-07  5:54     ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.