All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sd: Fix a disk probing hang
@ 2017-11-07 17:38 Bart Van Assche
  2017-11-07 18:09 ` James Bottomley
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-11-07 17:38 UTC (permalink / raw)
  To: Martin K . Petersen, James E . J . Bottomley
  Cc: linux-scsi, Bart Van Assche, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn

Avoid that disk probing hangs as follows if a SCSI host is removed
after disk scanning started and before it completed:

Call Trace:
 __schedule+0x2fa/0xbb0
 schedule+0x36/0x90
 schedule_timeout+0x22c/0x570
 io_schedule_timeout+0x1e/0x50
 wait_for_completion_io_timeout+0x11f/0x180
 blk_execute_rq+0x86/0xc0
 scsi_execute+0xdb/0x1f0
 sd_revalidate_disk+0xed/0x1c70 [sd_mod]
 sd_probe_async+0xc3/0x1d0 [sd_mod]
 async_run_entry_fn+0x38/0x160
 process_one_work+0x20a/0x660
 worker_thread+0x3d/0x3b0
 kthread+0x13a/0x150
 ret_from_fork+0x27/0x40

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/sd.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0313486d85c8..d5e2b73c02ea 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3225,11 +3225,13 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
 {
 	struct scsi_disk *sdkp = data;
 	struct scsi_device *sdp;
+	struct Scsi_Host *host;
 	struct gendisk *gd;
 	u32 index;
 	struct device *dev;
 
 	sdp = sdkp->device;
+	host = sdp->host;
 	gd = sdkp->disk;
 	index = sdkp->index;
 	dev = &sdp->sdev_gendev;
@@ -3253,6 +3255,13 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
 	sdkp->first_scan = 1;
 	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
 
+	mutex_lock(&host->scan_mutex);
+	if (!scsi_host_scan_allowed(host)) {
+		sd_printk(KERN_NOTICE, sdkp, "%s: host being removed\n",
+			  __func__);
+		goto unlock;
+	}
+
 	sd_revalidate_disk(gd);
 
 	gd->flags = GENHD_FL_EXT_DEVT;
@@ -3276,8 +3285,12 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
 
 	sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n",
 		  sdp->removable ? "removable " : "");
+unlock:
+	mutex_unlock(&host->scan_mutex);
+	scsi_host_put(host);
 	scsi_autopm_put_device(sdp);
 	put_device(&sdkp->dev);
+	return;
 }
 
 /**
@@ -3377,7 +3390,15 @@ static int sd_probe(struct device *dev)
 	get_device(dev);
 	dev_set_drvdata(dev, sdkp);
 
-	get_device(&sdkp->dev);	/* prevent release before async_schedule */
+	/* prevent release before async_schedule */
+	error = -ENODEV;
+	if (scsi_host_get(sdp->host) == NULL) {
+		sd_printk(KERN_NOTICE, sdkp, "%s: host being removed\n",
+			  __func__);
+		put_device(&sdkp->dev);
+		goto out;
+	}
+	get_device(&sdkp->dev);
 	async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain);
 
 	return 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sd: Fix a disk probing hang
  2017-11-07 17:38 [PATCH] sd: Fix a disk probing hang Bart Van Assche
@ 2017-11-07 18:09 ` James Bottomley
  2017-11-07 22:42   ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: James Bottomley @ 2017-11-07 18:09 UTC (permalink / raw)
  To: Bart Van Assche, Martin K . Petersen
  Cc: linux-scsi, Christoph Hellwig, Hannes Reinecke, Johannes Thumshirn

On Tue, 2017-11-07 at 09:38 -0800, Bart Van Assche wrote:
> Avoid that disk probing hangs as follows if a SCSI host is removed
> after disk scanning started and before it completed:
> 
> Call Trace:
>  __schedule+0x2fa/0xbb0
>  schedule+0x36/0x90
>  schedule_timeout+0x22c/0x570
>  io_schedule_timeout+0x1e/0x50
>  wait_for_completion_io_timeout+0x11f/0x180
>  blk_execute_rq+0x86/0xc0
>  scsi_execute+0xdb/0x1f0
>  sd_revalidate_disk+0xed/0x1c70 [sd_mod]
>  sd_probe_async+0xc3/0x1d0 [sd_mod]
>  async_run_entry_fn+0x38/0x160
>  process_one_work+0x20a/0x660
>  worker_thread+0x3d/0x3b0
>  kthread+0x13a/0x150
>  ret_from_fork+0x27/0x40
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/sd.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 0313486d85c8..d5e2b73c02ea 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3225,11 +3225,13 @@ static void sd_probe_async(void *data,
> async_cookie_t cookie)
>  {
>  	struct scsi_disk *sdkp = data;
>  	struct scsi_device *sdp;
> +	struct Scsi_Host *host;
>  	struct gendisk *gd;
>  	u32 index;
>  	struct device *dev;
>  
>  	sdp = sdkp->device;
> +	host = sdp->host;
>  	gd = sdkp->disk;
>  	index = sdkp->index;
>  	dev = &sdp->sdev_gendev;
> @@ -3253,6 +3255,13 @@ static void sd_probe_async(void *data,
> async_cookie_t cookie)
>  	sdkp->first_scan = 1;
>  	sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS;
>  
> +	mutex_lock(&host->scan_mutex);

I really don't like this: by taking the scan mutex here, you
synchronize this with everything else and make this routine single
threaded with every other host scan operation.  That would make the
name sd_probe_async() a complete lie.

Additionally, any reference to the disk should *automatically* hold the
host, because the last reference to the host is in the disk release
routine, so this explicit taking of a reference should be completely
unnecessary (and if it isn't, we need to fix the bug at source, not
hide it like this).

The whole point about our async routines is that they're supposed to
rely on refcounting.  So, the host cannot be freed until the last
device reference is gone.  However, the host and its devices can go
into DEL state, which means the mid-layer replies error for them and
the async scan is supposed to take that error and pass it up.  The hang
you're getting may be the result of a missing scsi_device_online()
check, or it could be some premature failure of the underlying device
driver (going into SHOST_DEL with outstanding commands causes them to
get frozen) but can you investigate the root cause rather than trying
this bandaid?

Thanks,

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sd: Fix a disk probing hang
  2017-11-07 18:09 ` James Bottomley
@ 2017-11-07 22:42   ` Bart Van Assche
  2017-11-07 22:57     ` James Bottomley
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2017-11-07 22:42 UTC (permalink / raw)
  To: jejb, martin.petersen; +Cc: linux-scsi, hch, hare, jthumshirn

On Tue, 2017-11-07 at 10:09 -0800, James Bottomley wrote:
> but can you investigate the root cause rather than trying this bandaid?

Hello James,

Thanks for your reply. I think that the root cause is that SCSI scanning
activity can continue to submit I/O even after scsi_remove_host() has
unlocked scan_mutex but that scsi_remove_host() removes some of the
infrastructure that is essential to process SCSI requests. Are you OK with
e.g. moving a significant part of scsi_remove_host() into
scsi_host_dev_release()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sd: Fix a disk probing hang
  2017-11-07 22:42   ` Bart Van Assche
@ 2017-11-07 22:57     ` James Bottomley
  2017-11-08  8:12       ` Hannes Reinecke
  0 siblings, 1 reply; 6+ messages in thread
From: James Bottomley @ 2017-11-07 22:57 UTC (permalink / raw)
  To: Bart Van Assche, martin.petersen; +Cc: linux-scsi, hch, hare, jthumshirn

On Tue, 2017-11-07 at 22:42 +0000, Bart Van Assche wrote:
> On Tue, 2017-11-07 at 10:09 -0800, James Bottomley wrote:
> > 
> > but can you investigate the root cause rather than trying this
> > bandaid?
> 
> Hello James,
> 
> Thanks for your reply. I think that the root cause is that SCSI
> scanning activity can continue to submit I/O even after
> scsi_remove_host() has unlocked scan_mutex but that
> scsi_remove_host() removes some of the infrastructure that is
> essential to process SCSI requests.

That's not really a useful answer: how does it submit I/O after the
device goes into DEL?  In theory every I/O submitted after this is
returned with an immediate error.  I could buy the fact that we have
pending I/O submitted before we go into DEL, which would argue for some
sort of quiesce wait, but I don't see how I/O submitted after DEL
causes a hang.

>  Are you OK with
> e.g. moving a significant part of scsi_remove_host() into
> scsi_host_dev_release()?

Well not really without seeing the root cause.  Before scsi_forget_host
()it's all about state and after it's just removing some user visible
host attributes, so I can't see how either matters much.
 scsi_forget_host() must be executed from scsi_remove_host() because
that's how the devices go into the DEL state and how we error the
requests without troubling the device driver, so that can't be moved to
release

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sd: Fix a disk probing hang
  2017-11-07 22:57     ` James Bottomley
@ 2017-11-08  8:12       ` Hannes Reinecke
  2017-11-08 16:31         ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Hannes Reinecke @ 2017-11-08  8:12 UTC (permalink / raw)
  To: James Bottomley, Bart Van Assche, martin.petersen
  Cc: linux-scsi, hch, jthumshirn

On 11/07/2017 11:57 PM, James Bottomley wrote:
> On Tue, 2017-11-07 at 22:42 +0000, Bart Van Assche wrote:
>> On Tue, 2017-11-07 at 10:09 -0800, James Bottomley wrote:
>>>
>>> but can you investigate the root cause rather than trying this
>>> bandaid?
>>
>> Hello James,
>>
>> Thanks for your reply. I think that the root cause is that SCSI
>> scanning activity can continue to submit I/O even after
>> scsi_remove_host() has unlocked scan_mutex but that
>> scsi_remove_host() removes some of the infrastructure that is
>> essential to process SCSI requests.
> 
> That's not really a useful answer: how does it submit I/O after the
> device goes into DEL?  In theory every I/O submitted after this is
> returned with an immediate error.  I could buy the fact that we have
> pending I/O submitted before we go into DEL, which would argue for some
> sort of quiesce wait, but I don't see how I/O submitted after DEL
> causes a hang.
> 
>>  Are you OK with
>> e.g. moving a significant part of scsi_remove_host() into
>> scsi_host_dev_release()?
> 
> Well not really without seeing the root cause.  Before scsi_forget_host
> ()it's all about state and after it's just removing some user visible
> host attributes, so I can't see how either matters much.
>  scsi_forget_host() must be executed from scsi_remove_host() because
> that's how the devices go into the DEL state and how we error the
> requests without troubling the device driver, so that can't be moved to
> release
> 
You know, this actually looks like the same issue I'm chasing with iser;
we have a customer who regularly sees lockups during scanning.
As it turns out, iser is calling scsi_device_del() from the RX thread.
Which in turn needs to call async_synchronize().
If a disk scan is running at the same time we have a nice deadlock, as
the RX thread can't move forward before aynch_synchronize() returns,
which it'll never do as the scan cannot complete.
I've tried to fix that by having the async probing only waiting for that
particular instance (look for patch 'sd: use async_probe cookie to avoid
deadlocks'), but this wasn't greeted with much enthusiasm.

So maybe it's time to investigate this properly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sd: Fix a disk probing hang
  2017-11-08  8:12       ` Hannes Reinecke
@ 2017-11-08 16:31         ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2017-11-08 16:31 UTC (permalink / raw)
  To: jejb, hare, martin.petersen; +Cc: linux-scsi, hch, jthumshirn

On Wed, 2017-11-08 at 09:12 +0100, Hannes Reinecke wrote:
> You know, this actually looks like the same issue I'm chasing with iser;
> we have a customer who regularly sees lockups during scanning.
> As it turns out, iser is calling scsi_device_del() from the RX thread.
> Which in turn needs to call async_synchronize().
> If a disk scan is running at the same time we have a nice deadlock, as
> the RX thread can't move forward before aynch_synchronize() returns,
> which it'll never do as the scan cannot complete.
> I've tried to fix that by having the async probing only waiting for that
> particular instance (look for patch 'sd: use async_probe cookie to avoid
> deadlocks'), but this wasn't greeted with much enthusiasm.

Hello Hannes,

Since I applied Roman Penyaev's patch "[PATCH 1/1] [RFC] blk-mq: fix queue
stalling on shared hctx restart" I have not been able to reproduce this hang.
I will let you know if I would run into this hang again.

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-11-08 16:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-07 17:38 [PATCH] sd: Fix a disk probing hang Bart Van Assche
2017-11-07 18:09 ` James Bottomley
2017-11-07 22:42   ` Bart Van Assche
2017-11-07 22:57     ` James Bottomley
2017-11-08  8:12       ` Hannes Reinecke
2017-11-08 16:31         ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.