All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Christie <michael.christie@oracle.com>
To: Wenchao Hao <haowenchao2@huawei.com>,
	"James E . J . Bottomley" <jejb@linux.ibm.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, louhongxiang@huawei.com
Subject: Re: [PATCH v3 4/4] scsi: scsi_core: Fix IO hang when device removing
Date: Tue, 14 Nov 2023 15:47:43 -0600	[thread overview]
Message-ID: <b00a9920-f16f-4187-9a7c-6083c5d98fb8@oracle.com> (raw)
In-Reply-To: <b95cf250-9a58-4a9f-8b55-4b6d00a6a6c2@oracle.com>

On 11/14/23 3:23 PM, Mike Christie wrote:
> On 10/15/23 9:03 PM, Wenchao Hao wrote:
>> shost_for_each_device() would skip devices which is in progress of
>> removing, so scsi_run_queue() for these devices would be skipped in
>> scsi_run_host_queues() after blocking hosts' IO.
>>
>> IO hang would be caused if return true when state is SDEV_CANCEL with
>> following order:
>>
>> T1:					    T2:scsi_error_handler
>> __scsi_remove_device()
>>   scsi_device_set_state(sdev, SDEV_CANCEL)
>>   ...
>>   sd_remove()
>>   del_gendisk()
>>   blk_mq_freeze_queue_wait()
>>   					    scsi_eh_flush_done_q()
>> 					      scsi_queue_insert(scmd,...)
>>
>> scsi_queue_insert() would not kick device's queue since commit
>> 8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary")
>>
>> After scsi_unjam_host(), the scsi error handler would call
>> scsi_run_host_queues() to trigger run queue for devices, while it
>> would not run queue for devices which is in progress of removing
>> because shost_for_each_device() would skip them.
>>
>> So the requests added to these queues would not be handled any more,
>> and the removing device process would hang too.
>>
>> Fix this issue by using shost_for_each_device_include_deleted() in
>> scsi_run_host_queues() to trigger a run queue for devices in removing.
>>
>> Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
>> ---
>>  drivers/scsi/scsi_lib.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index 195ca80667d0..40f407ffd26f 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -466,7 +466,7 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
>>  {
>>  	struct scsi_device *sdev;
>>  
>> -	shost_for_each_device(sdev, shost)
>> +	shost_for_each_device_include_deleted(sdev, shost)
>>  		scsi_run_queue(sdev->request_queue);
> 
> What happens if there were no commands for the device that
> was destroyed and we race with this code and device deletion?
> 
> So thread1 has set the device state tp SDEV_DEL and has finished
> blk_mq_destroy_queue because there were no commands running.
> 
> The above eh thread, then is calling:
> 
> scsi_run_queue -> blk_mq_kick_requeue_list
> 
> and that queues the requeue work.
> 
> blk_mq_destroy_queue had done blk_mq_cancel_work_sync but
> blk_mq_kick_requeue_list just added it back on the kblockd_workqueue.
> 
> When __scsi_iterate_devices does scsi_device_put it would call
> scsi_device_dev_release and call blk_put_queue which frees the
> request_queue while it's requeue work might still be queued on
> kblockd_workqueue.
> 

Oh yeah, for your other lun/target reset patches were you trying to
do something where you have a list for each scsi_device or a list of
scsi_devices that needed error handler work? If so, maybe break that
part out and use it here first.

You can then just loop over the list of devices that needed work and
start those above.

  reply	other threads:[~2023-11-14 21:48 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-16  2:03 [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle Wenchao Hao
2023-10-16  2:03 ` [PATCH v3 1/4] scsi: core: Add new helper to iterate all devices of host Wenchao Hao
2023-10-17 15:31   ` kernel test robot
2023-10-16  2:03 ` [PATCH v3 2/4] scsi: scsi_error: Fix wrong statistic when print error info Wenchao Hao
2023-10-16  2:03 ` [PATCH v3 3/4] scsi: scsi_error: Fix device reset is not triggered Wenchao Hao
2023-10-16  2:03 ` [PATCH v3 4/4] scsi: scsi_core: Fix IO hang when device removing Wenchao Hao
2023-11-14 21:23   ` Mike Christie
2023-11-14 21:47     ` Mike Christie [this message]
2023-11-15 16:24       ` Wenchao Hao
2024-03-07 14:36   ` Wenchao Hao
2023-10-17 17:00 ` [PATCH v3 0/4] SCSI: Fix issues between removing device and error handle Wenchao Hao
2023-10-17 21:41   ` Bart Van Assche
2023-10-18  1:37     ` Wenchao Hao
2023-10-18 13:51       ` Bart Van Assche
2023-10-18 14:40         ` Wenchao Hao
2023-10-18 18:03           ` Bart Van Assche
2023-11-13 16:00 ` Wenchao Hao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b00a9920-f16f-4187-9a7c-6083c5d98fb8@oracle.com \
    --to=michael.christie@oracle.com \
    --cc=haowenchao2@huawei.com \
    --cc=jejb@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=louhongxiang@huawei.com \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.