All of lore.kernel.org
 help / color / mirror / Atom feed
* SCSI layer RPM deadlock debug suggestion
@ 2021-07-02 17:03 John Garry
  2021-07-02 20:31 ` Alan Stern
  0 siblings, 1 reply; 13+ messages in thread
From: John Garry @ 2021-07-02 17:03 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, Alan Stern, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke
  Cc: chenxiang, Xiejianqin

Hi guys,

We're experiencing a deadlock between trying to remove a SATA device and 
doing a rescan in scsi_rescan_device().

I'm just looking for a suggestion on how to solve.

The background is that the host (hisi sas v3 hw) uses SAS SCSI transport 
and supports RPM. In the testcase, the host and disks are put to 
suspend. Then we run fio on the disk to make them active and then 
immediately hard reset the disk link, which causes the disk to be 
disconnected (please don't ask why ...).

We find that there is a conflict between the rescan and the device 
removal code, resulting in a deadlock:

a 1158050441d:06[ 607.429281] Call trace:
[ 607.433083] __switch_to+0x164/0x1d4
[ 607.437596] __schedule+0x8f8/0x1450
[ 607.441183] schedule+0x7c/0x110
[ 607.444422] blk_queue_enter+0x290/0x490
[ 607.448358] blk_mq_alloc_request+0x50/0xb4
[ 607.452547] blk_get_request+0x38/0x80
[ 607.456305] __scsi_execute+0x6c/0x1c4
[ 607.460064] scsi_vpd_inquiry+0x88/0xf0
[ 607.463908] scsi_get_vpd_buf+0x68/0xb0
[ 607.467752] scsi_attach_vpd+0x58/0x170
[ 607.471596] scsi_rescan_device+0x40/0xac
[ 607.475612] ata_scsi_dev_rescan+0xb4/0x14c
[ 607.479802] process_one_work+0x29c/0x6fc
[ 607.483819] worker_thread+0x80/0x470
[ 607.487489] kthread+0x15c/0x170
[ 607.490727] ret_from_fork+0x10/0x18

sas_phy_event_worker [libsas]
[ 607.529831] Call trace:
[ 607.532312] __switch_to+0x164/0x1d4
[ 607.535900] __schedule+0x8f8/0x1450
[ 607.539484] schedule+0x7c/0x110
[ 607.542724] schedule_preempt_disabled+0x30/0x4c
[ 607.547345] __mutex_lock+0x308/0x8b0
[ 607.551016] mutex_lock_nested+0x44/0x70
[ 607.554947] device_del+0x4c/0x450
[ 607.558341] __scsi_remove_device+0x11c/0x14c
[ 607.562702] scsi_remove_target+0x1bc/0x240
[ 607.566891] sas_rphy_remove+0x90/0x94
[ 607.570649] sas_rphy_delete+0x24/0x40
[ 607.574388] sas_destruct_devices+0x64/0xa0 [libsas]
[ 607.579359] sas_deform_port+0x178/0x1bc [libsas]
[ 607.584069] sas_phye_loss_of_signal+0x28/0x34 [libsas]
[ 607.589298] sas_phy_event_worker+0x34/0x50 [libsas]
[ 607.594268] process_one_work+0x29c/0x6fc
[ 607.598284] worker_thread+0x80/0x470
[ 607.601955] kthread+0x15c/0x170
[ 607.605193] ret_from_fork+0x10/0x18
[ 607.608845] INFO: task fio:3382 blocked for more than 121

The rescan holds the sdev_gendev.device lock in scsi_rescan_device(), 
while the removal code in __scsi_remove_device() wants to grab it.

However the rescan will not release (the lock) until the 
blk_queue_enter() succeeds, above. That can happen 2x ways:

- the queue is dying, which would not happen until after the 
device_del() in __scsi_remove_device(), so not going to happen

- q->pm_only falls to 0. This would be when scsi_runtime_resume() -> 
sdev_runtime_resume() -> blk_post_runtime_resume(err = 0) -> 
blk_set_runtime_active() is called. However, I find that the err 
argument for me is -5, which comes from sdev_runtime_resume() -> 
pm->runtime_resume (=sd_resume()), which fails. That sd_resume() -> 
sd_start_stop_device() fails as the disk is not attached. So we go into 
error state:

$:more 
/sys/devices/pci0000:b4/0000:b4:04.0/host3/port-3:0/end_device-3:0/target3:0:0/3:0:0:0/power/runtime_status
error

Removing commit e27829dc92e5 ("scsi: serialize ->rescan against 
->remove") solves this issue for me, but that is there for a reason.

Any suggestion on how to fix this deadlock?

Thanks,
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-02 17:03 SCSI layer RPM deadlock debug suggestion John Garry
@ 2021-07-02 20:31 ` Alan Stern
  2021-07-04 23:45   ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2021-07-02 20:31 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang, Xiejianqin

On Fri, Jul 02, 2021 at 06:03:20PM +0100, John Garry wrote:
> Hi guys,
> 
> We're experiencing a deadlock between trying to remove a SATA device and
> doing a rescan in scsi_rescan_device().
> 
> I'm just looking for a suggestion on how to solve.
> 
> The background is that the host (hisi sas v3 hw) uses SAS SCSI transport and
> supports RPM. In the testcase, the host and disks are put to suspend. Then
> we run fio on the disk to make them active and then immediately hard reset
> the disk link, which causes the disk to be disconnected (please don't ask
> why ...).
> 
> We find that there is a conflict between the rescan and the device removal
> code, resulting in a deadlock:

> The rescan holds the sdev_gendev.device lock in scsi_rescan_device(), while
> the removal code in __scsi_remove_device() wants to grab it.
> 
> However the rescan will not release (the lock) until the blk_queue_enter()
> succeeds, above. That can happen 2x ways:
> 
> - the queue is dying, which would not happen until after the device_del() in
> __scsi_remove_device(), so not going to happen
> 
> - q->pm_only falls to 0. This would be when scsi_runtime_resume() ->
> sdev_runtime_resume() -> blk_post_runtime_resume(err = 0) ->
> blk_set_runtime_active() is called. However, I find that the err argument
> for me is -5, which comes from sdev_runtime_resume() -> pm->runtime_resume
> (=sd_resume()), which fails. That sd_resume() -> sd_start_stop_device()
> fails as the disk is not attached. So we go into error state:
> 
> $:more /sys/devices/pci0000:b4/0000:b4:04.0/host3/port-3:0/end_device-3:0/target3:0:0/3:0:0:0/power/runtime_status
> error
> 
> Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
> solves this issue for me, but that is there for a reason.
> 
> Any suggestion on how to fix this deadlock?

This is indeed a tricky question.  It seems like we should allow a 
runtime resume to succeed if the only reason it failed was that the 
device has been removed.

More generally, perhaps we should always consider that a runtime 
resume succeeds.  Any remaining problems will be dealt with by the 
device's driver and subsystem once the device is marked as 
runtime-active again.

Suppose you try changing blk_post_runtime_resume() so that it always 
calls blk_set_runtime_active() regardless of the value of err.  Does 
that fix the problem?

And more importantly, will it cause any other problems...?

Alan Stern

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-02 20:31 ` Alan Stern
@ 2021-07-04 23:45   ` Bart Van Assche
  2021-07-05 12:00     ` John Garry
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2021-07-04 23:45 UTC (permalink / raw)
  To: Alan Stern, John Garry
  Cc: Christoph Hellwig, linux-scsi, Martin K . Petersen,
	James E.J. Bottomley, Hannes Reinecke, chenxiang, Xiejianqin

On 7/2/21 1:31 PM, Alan Stern wrote:
> On Fri, Jul 02, 2021 at 06:03:20PM +0100, John Garry wrote:
>> Hi guys,
>>
>> We're experiencing a deadlock between trying to remove a SATA device and
>> doing a rescan in scsi_rescan_device().
>>
>> I'm just looking for a suggestion on how to solve.
>>
>> The background is that the host (hisi sas v3 hw) uses SAS SCSI transport and
>> supports RPM. In the testcase, the host and disks are put to suspend. Then
>> we run fio on the disk to make them active and then immediately hard reset
>> the disk link, which causes the disk to be disconnected (please don't ask
>> why ...).
>>
>> We find that there is a conflict between the rescan and the device removal
>> code, resulting in a deadlock:
> 
>> The rescan holds the sdev_gendev.device lock in scsi_rescan_device(), while
>> the removal code in __scsi_remove_device() wants to grab it.
>>
>> However the rescan will not release (the lock) until the blk_queue_enter()
>> succeeds, above. That can happen 2x ways:
>>
>> - the queue is dying, which would not happen until after the device_del() in
>> __scsi_remove_device(), so not going to happen
>>
>> - q->pm_only falls to 0. This would be when scsi_runtime_resume() ->
>> sdev_runtime_resume() -> blk_post_runtime_resume(err = 0) ->
>> blk_set_runtime_active() is called. However, I find that the err argument
>> for me is -5, which comes from sdev_runtime_resume() -> pm->runtime_resume
>> (=sd_resume()), which fails. That sd_resume() -> sd_start_stop_device()
>> fails as the disk is not attached. So we go into error state:
>>
>> $:more /sys/devices/pci0000:b4/0000:b4:04.0/host3/port-3:0/end_device-3:0/target3:0:0/3:0:0:0/power/runtime_status
>> error
>>
>> Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
>> solves this issue for me, but that is there for a reason.
>>
>> Any suggestion on how to fix this deadlock?
> 
> This is indeed a tricky question.  It seems like we should allow a 
> runtime resume to succeed if the only reason it failed was that the 
> device has been removed.
> 
> More generally, perhaps we should always consider that a runtime 
> resume succeeds.  Any remaining problems will be dealt with by the 
> device's driver and subsystem once the device is marked as 
> runtime-active again.
> 
> Suppose you try changing blk_post_runtime_resume() so that it always 
> calls blk_set_runtime_active() regardless of the value of err.  Does 
> that fix the problem?
> 
> And more importantly, will it cause any other problems...?

That would cause trouble for the UFS driver and other drivers for which
runtime resume can fail due to e.g. the link between host and device
being in a bad state.

How about checking the SCSI device state inside scsi_rescan_device() and
skipping the rescan if the SCSI device state is SDEV_CANCEL or SDEV_DEL?

Adding such a check inside __scsi_execute() would break sd_remove() and
sd_shutdown() since both use __scsi_execute() to submit a SYNCHRONIZE
CACHE command to the device.

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-04 23:45   ` Bart Van Assche
@ 2021-07-05 12:00     ` John Garry
  2021-07-05 13:17       ` Alan Stern
  0 siblings, 1 reply; 13+ messages in thread
From: John Garry @ 2021-07-05 12:00 UTC (permalink / raw)
  To: Bart Van Assche, Alan Stern
  Cc: Christoph Hellwig, linux-scsi, Martin K . Petersen,
	James E.J. Bottomley, Hannes Reinecke, chenxiang, Xiejianqin

On 05/07/2021 00:45, Bart Van Assche wrote:

Hi Alan and Bart,

Thanks for the suggestions.

>>> Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
>>> solves this issue for me, but that is there for a reason.
>>>
>>> Any suggestion on how to fix this deadlock?
>> This is indeed a tricky question.  It seems like we should allow a
>> runtime resume to succeed if the only reason it failed was that the
>> device has been removed.
>>
>> More generally, perhaps we should always consider that a runtime
>> resume succeeds.  Any remaining problems will be dealt with by the
>> device's driver and subsystem once the device is marked as
>> runtime-active again.
>>
>> Suppose you try changing blk_post_runtime_resume() so that it always
>> calls blk_set_runtime_active() regardless of the value of err.  Does
>> that fix the problem?
>>
>> And more importantly, will it cause any other problems...?
> That would cause trouble for the UFS driver and other drivers for which
> runtime resume can fail due to e.g. the link between host and device
> being in a bad state.
> 
> How about checking the SCSI device state inside scsi_rescan_device() and
> skipping the rescan if the SCSI device state is SDEV_CANCEL or SDEV_DEL?
> 

I find that the device state is SDEV_RUNNING for me at that point (so it 
cannot work).

> Adding such a check inside __scsi_execute() would break sd_remove() and
> sd_shutdown() since both use __scsi_execute() to submit a SYNCHRONIZE
> CACHE command to the device.

Could we somehow signal from __scsi_remove_device() earlier that the 
request queue is dying or at least in some error state, so that 
blk_queue_enter() in the rescan can fail?

Currently we don't call blk_cleanup_queue() -> blk_set_queue_dying() 
until after the device_del(sdev_gendev) call in __scsi_remove_device().

Thanks,
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-05 12:00     ` John Garry
@ 2021-07-05 13:17       ` Alan Stern
  2021-07-05 13:20         ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2021-07-05 13:17 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang, Xiejianqin

On Mon, Jul 05, 2021 at 01:00:39PM +0100, John Garry wrote:
> On 05/07/2021 00:45, Bart Van Assche wrote:
> 
> Hi Alan and Bart,
> 
> Thanks for the suggestions.
> 
> > > > Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
> > > > solves this issue for me, but that is there for a reason.
> > > > 
> > > > Any suggestion on how to fix this deadlock?
> > > This is indeed a tricky question.  It seems like we should allow a
> > > runtime resume to succeed if the only reason it failed was that the
> > > device has been removed.
> > > 
> > > More generally, perhaps we should always consider that a runtime
> > > resume succeeds.  Any remaining problems will be dealt with by the
> > > device's driver and subsystem once the device is marked as
> > > runtime-active again.
> > > 
> > > Suppose you try changing blk_post_runtime_resume() so that it always
> > > calls blk_set_runtime_active() regardless of the value of err.  Does
> > > that fix the problem?
> > > 
> > > And more importantly, will it cause any other problems...?
> > That would cause trouble for the UFS driver and other drivers for which
> > runtime resume can fail due to e.g. the link between host and device
> > being in a bad state.

I don't understand how that could work.  If a device fails to resume 
from runtime suspend, no matter whether the reason is temporary or 
permanent, how can the system use it again?

And if the system can't use it again, what harm is there in pretending 
that the runtime resume succeeded?

> > How about checking the SCSI device state inside scsi_rescan_device() and
> > skipping the rescan if the SCSI device state is SDEV_CANCEL or SDEV_DEL?
> > 
> 
> I find that the device state is SDEV_RUNNING for me at that point (so it
> cannot work).
> 
> > Adding such a check inside __scsi_execute() would break sd_remove() and
> > sd_shutdown() since both use __scsi_execute() to submit a SYNCHRONIZE
> > CACHE command to the device.
> 
> Could we somehow signal from __scsi_remove_device() earlier that the request
> queue is dying or at least in some error state, so that blk_queue_enter() in
> the rescan can fail?
> 
> Currently we don't call blk_cleanup_queue() -> blk_set_queue_dying() until
> after the device_del(sdev_gendev) call in __scsi_remove_device().

I don't think that can be done.  device_del() calls the driver's 
remove routine, which may want to communicate with the device.  If the 
request queue is already in an error state, it won't be able to do so.

Alan Stern

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-05 13:17       ` Alan Stern
@ 2021-07-05 13:20         ` Hannes Reinecke
  2021-07-07 15:08           ` John Garry
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2021-07-05 13:20 UTC (permalink / raw)
  To: Alan Stern, John Garry
  Cc: Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang, Xiejianqin

On 7/5/21 3:17 PM, Alan Stern wrote:
> On Mon, Jul 05, 2021 at 01:00:39PM +0100, John Garry wrote:
>> On 05/07/2021 00:45, Bart Van Assche wrote:
>>
>> Hi Alan and Bart,
>>
>> Thanks for the suggestions.
>>
>>>>> Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
>>>>> solves this issue for me, but that is there for a reason.
>>>>>
>>>>> Any suggestion on how to fix this deadlock?
>>>> This is indeed a tricky question.  It seems like we should allow a
>>>> runtime resume to succeed if the only reason it failed was that the
>>>> device has been removed.
>>>>
>>>> More generally, perhaps we should always consider that a runtime
>>>> resume succeeds.  Any remaining problems will be dealt with by the
>>>> device's driver and subsystem once the device is marked as
>>>> runtime-active again.
>>>>
>>>> Suppose you try changing blk_post_runtime_resume() so that it always
>>>> calls blk_set_runtime_active() regardless of the value of err.  Does
>>>> that fix the problem?
>>>>
>>>> And more importantly, will it cause any other problems...?
>>> That would cause trouble for the UFS driver and other drivers for which
>>> runtime resume can fail due to e.g. the link between host and device
>>> being in a bad state.
> 
> I don't understand how that could work.  If a device fails to resume
> from runtime suspend, no matter whether the reason is temporary or
> permanent, how can the system use it again?
> 
> And if the system can't use it again, what harm is there in pretending
> that the runtime resume succeeded?
> 
'xactly.
Especially as we _do_ have error recovery on SCSI, so we should be 
treating a failure to resume just like any other SCSI error; in the end, 
we need to equip SCSI EH to deal with these kind of states anyway.
And we already do, as we're sending 'START STOP UNIT' already to spin up 
drives which are found to be spun down.

So I'm all for always returning 'success' from the 'resume' callback and 
let SCSI EH deal with any eventual fallout.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-05 13:20         ` Hannes Reinecke
@ 2021-07-07 15:08           ` John Garry
  2021-07-14 16:10             ` Alan Stern
  0 siblings, 1 reply; 13+ messages in thread
From: John Garry @ 2021-07-07 15:08 UTC (permalink / raw)
  To: Hannes Reinecke, Alan Stern
  Cc: Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang, Xiejianqin

>>>>>> Any suggestion on how to fix this deadlock?
>>>>> This is indeed a tricky question.  It seems like we should allow a
>>>>> runtime resume to succeed if the only reason it failed was that the
>>>>> device has been removed.
>>>>>
>>>>> More generally, perhaps we should always consider that a runtime
>>>>> resume succeeds.  Any remaining problems will be dealt with by the
>>>>> device's driver and subsystem once the device is marked as
>>>>> runtime-active again.
>>>>>
>>>>> Suppose you try changing blk_post_runtime_resume() so that it always
>>>>> calls blk_set_runtime_active() regardless of the value of err.  Does
>>>>> that fix the problem?
>>>>>

Hi Alan,

I tried that suggestion with the following change:


--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -185,9 +185,8 @@ EXPORT_SYMBOL(blk_pre_runtime_resume);
   */
void blk_post_runtime_resume(struct request_queue *q, int err)
{
-
+       err = 0;
         if (!q->dev)
                 return;
         if (!err) {


And that looks to solve the deadlock which I was seeing. I'm not sure on 
side-effects elsewhere.

We'll test it a bit more.

Thanks,
John

>>>>> And more importantly, will it cause any other problems...?
>>>> That would cause trouble for the UFS driver and other drivers for which
>>>> runtime resume can fail due to e.g. the link between host and device
>>>> being in a bad state.
>>
>> I don't understand how that could work.  If a device fails to resume
>> from runtime suspend, no matter whether the reason is temporary or
>> permanent, how can the system use it again?
>>
>> And if the system can't use it again, what harm is there in pretending
>> that the runtime resume succeeded?
>>
> 'xactly.
> Especially as we _do_ have error recovery on SCSI, so we should be 
> treating a failure to resume just like any other SCSI error; in the end, 
> we need to equip SCSI EH to deal with these kind of states anyway.
> And we already do, as we're sending 'START STOP UNIT' already to spin up 
> drives which are found to be spun down.
> 
> So I'm all for always returning 'success' from the 'resume' callback and 
> let SCSI EH deal with any eventual fallout.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-07 15:08           ` John Garry
@ 2021-07-14 16:10             ` Alan Stern
  2021-07-14 16:48               ` John Garry
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Stern @ 2021-07-14 16:10 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang, Xiejianqin

On Wed, Jul 07, 2021 at 04:08:34PM +0100, John Garry wrote:
> > > > > > > Any suggestion on how to fix this deadlock?
> > > > > > This is indeed a tricky question.  It seems like we should allow a
> > > > > > runtime resume to succeed if the only reason it failed was that the
> > > > > > device has been removed.
> > > > > > 
> > > > > > More generally, perhaps we should always consider that a runtime
> > > > > > resume succeeds.  Any remaining problems will be dealt with by the
> > > > > > device's driver and subsystem once the device is marked as
> > > > > > runtime-active again.
> > > > > > 
> > > > > > Suppose you try changing blk_post_runtime_resume() so that it always
> > > > > > calls blk_set_runtime_active() regardless of the value of err.  Does
> > > > > > that fix the problem?
> > > > > > 
> 
> Hi Alan,
> 
> I tried that suggestion with the following change:
> 
> 
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -185,9 +185,8 @@ EXPORT_SYMBOL(blk_pre_runtime_resume);
>   */
> void blk_post_runtime_resume(struct request_queue *q, int err)
> {
> -
> +       err = 0;
>         if (!q->dev)
>                 return;
>         if (!err) {
> 
> 
> And that looks to solve the deadlock which I was seeing. I'm not sure on
> side-effects elsewhere.
> 
> We'll test it a bit more.

In the absence of any bad reports, here is a proposal for a patch.

Comments?

Alan Stern



John Garry reported a deadlock that occurs when trying to access a
runtime-suspended SATA device.  For obscure reasons, the rescan
procedure causes the link to be hard-reset, which disconnects the
device.

The rescan tries to carry out a runtime resume when accessing the
device.  scsi_rescan_device() holds the SCSI device lock and won't
release it until it can put commands onto the device's block queue.
This can't happen until the queue is successfully runtime-resumed or
the device is unregistered.  But the runtime resume fails because the
device is disconnected, and __scsi_remove_device() can't do the
unregistration because it can't get the device lock.

The best way to resolve this deadlock appears to be to allow the block
queue to start running again even after an unsuccessful runtime
resume.  The idea is that the driver or the SCSI error handler will
need to be able to use the queue to resolve the runtime resume
failure.

This patch removes the err argument to blk_post_runtime_resume() and
makes the routine act as though the resume was successful always.
This fixes the deadlock.

Reported-and-tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Fixes: e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
CC: Bart Van Assche <bvanassche@acm.org>
CC: Hannes Reinecke <hare@suse.de>
CC: <stable@vger.kernel.org>

---

 block/blk-pm.c         |   22 +++++++---------------
 drivers/scsi/scsi_pm.c |    2 +-
 include/linux/blk-pm.h |    2 +-
 3 files changed, 9 insertions(+), 17 deletions(-)

Index: usb-devel/block/blk-pm.c
===================================================================
--- usb-devel.orig/block/blk-pm.c
+++ usb-devel/block/blk-pm.c
@@ -163,27 +163,19 @@ EXPORT_SYMBOL(blk_pre_runtime_resume);
 /**
  * blk_post_runtime_resume - Post runtime resume processing
  * @q: the queue of the device
- * @err: return value of the device's runtime_resume function
  *
  * Description:
- *    Update the queue's runtime status according to the return value of the
- *    device's runtime_resume function. If the resume was successful, call
- *    blk_set_runtime_active() to do the real work of restarting the queue.
+ *    For historical reasons, this routine merely calls blk_set_runtime_active()
+ *    to do the real work of restarting the queue.  It does this regardless of
+ *    whether the device's runtime-resume succeeded; even if it failed the
+ *    driver or error handler will need to communicate with the device.
  *
  *    This function should be called near the end of the device's
  *    runtime_resume callback.
  */
-void blk_post_runtime_resume(struct request_queue *q, int err)
+void blk_post_runtime_resume(struct request_queue *q)
 {
-	if (!q->dev)
-		return;
-	if (!err) {
-		blk_set_runtime_active(q);
-	} else {
-		spin_lock_irq(&q->queue_lock);
-		q->rpm_status = RPM_SUSPENDED;
-		spin_unlock_irq(&q->queue_lock);
-	}
+	blk_set_runtime_active(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_resume);
 
@@ -201,7 +193,7 @@ EXPORT_SYMBOL(blk_post_runtime_resume);
  * runtime PM status and re-enable peeking requests from the queue. It
  * should be called before first request is added to the queue.
  *
- * This function is also called by blk_post_runtime_resume() for successful
+ * This function is also called by blk_post_runtime_resume() for
  * runtime resumes.  It does everything necessary to restart the queue.
  */
 void blk_set_runtime_active(struct request_queue *q)
Index: usb-devel/drivers/scsi/scsi_pm.c
===================================================================
--- usb-devel.orig/drivers/scsi/scsi_pm.c
+++ usb-devel/drivers/scsi/scsi_pm.c
@@ -262,7 +262,7 @@ static int sdev_runtime_resume(struct de
 	blk_pre_runtime_resume(sdev->request_queue);
 	if (pm && pm->runtime_resume)
 		err = pm->runtime_resume(dev);
-	blk_post_runtime_resume(sdev->request_queue, err);
+	blk_post_runtime_resume(sdev->request_queue);
 
 	return err;
 }
Index: usb-devel/include/linux/blk-pm.h
===================================================================
--- usb-devel.orig/include/linux/blk-pm.h
+++ usb-devel/include/linux/blk-pm.h
@@ -14,7 +14,7 @@ extern void blk_pm_runtime_init(struct r
 extern int blk_pre_runtime_suspend(struct request_queue *q);
 extern void blk_post_runtime_suspend(struct request_queue *q, int err);
 extern void blk_pre_runtime_resume(struct request_queue *q);
-extern void blk_post_runtime_resume(struct request_queue *q, int err);
+extern void blk_post_runtime_resume(struct request_queue *q);
 extern void blk_set_runtime_active(struct request_queue *q);
 #else
 static inline void blk_pm_runtime_init(struct request_queue *q,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-14 16:10             ` Alan Stern
@ 2021-07-14 16:48               ` John Garry
  2021-07-14 17:10                 ` Alan Stern
  2021-09-28 14:05                 ` Alan Stern
  0 siblings, 2 replies; 13+ messages in thread
From: John Garry @ 2021-07-14 16:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang (M),
	Xiejianqin

>>
>> And that looks to solve the deadlock which I was seeing. I'm not sure on
>> side-effects elsewhere.
>>
>> We'll test it a bit more.
> 
> In the absence of any bad reports, here is a proposal for a patch.
> 
> Comments?
> 
> Alan Stern

Hi Alan,

Sorry for not getting back to you sooner. Testing so far with the 
originally proposed change [0] has not raised any issues and has solved 
the deadlock.

But we have a list of other problems to deal with in the RPM area 
related to the LLDD/libsas, so were waiting to address all of them (or 
at least have a plan) before progressing this change.

One such issue is that when we issue the link-reset which causes the 
device to be lost in the test, the disk is not found again. The customer 
may not be happy with this, so we're investigating solutions.

As for your change itself, I had something similar sitting on our dev 
branch:

[0] 
https://github.com/hisilicon/kernel-dev/commit/3696ca85c1e00257c96e40154d28b936742430c4

For me, I'm happy to hold off on any change, but if you think it's 
serious enough to progress your patch, below, now, then I think that 
should be ok.

Thanks,
John

> 
> 
> 
> John Garry reported a deadlock that occurs when trying to access a
> runtime-suspended SATA device.  For obscure reasons, the rescan
> procedure causes the link to be hard-reset, which disconnects the
> device.
> 
> The rescan tries to carry out a runtime resume when accessing the
> device.  scsi_rescan_device() holds the SCSI device lock and won't
> release it until it can put commands onto the device's block queue.
> This can't happen until the queue is successfully runtime-resumed or
> the device is unregistered.  But the runtime resume fails because the
> device is disconnected, and __scsi_remove_device() can't do the
> unregistration because it can't get the device lock.
> 
> The best way to resolve this deadlock appears to be to allow the block
> queue to start running again even after an unsuccessful runtime
> resume.  The idea is that the driver or the SCSI error handler will
> need to be able to use the queue to resolve the runtime resume
> failure.
> 
> This patch removes the err argument to blk_post_runtime_resume() and
> makes the routine act as though the resume was successful always.
> This fixes the deadlock.
> 
> Reported-and-tested-by: John Garry <john.garry@huawei.com>
> Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
> Fixes: e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
> CC: Bart Van Assche <bvanassche@acm.org>
> CC: Hannes Reinecke <hare@suse.de>
> CC: <stable@vger.kernel.org>
> 
> ---
> 
>   block/blk-pm.c         |   22 +++++++---------------
>   drivers/scsi/scsi_pm.c |    2 +-
>   include/linux/blk-pm.h |    2 +-
>   3 files changed, 9 insertions(+), 17 deletions(-)
> 
> Index: usb-devel/block/blk-pm.c
> ===================================================================
> --- usb-devel.orig/block/blk-pm.c
> +++ usb-devel/block/blk-pm.c
> @@ -163,27 +163,19 @@ EXPORT_SYMBOL(blk_pre_runtime_resume);
>   /**
>    * blk_post_runtime_resume - Post runtime resume processing
>    * @q: the queue of the device
> - * @err: return value of the device's runtime_resume function
>    *
>    * Description:
> - *    Update the queue's runtime status according to the return value of the
> - *    device's runtime_resume function. If the resume was successful, call
> - *    blk_set_runtime_active() to do the real work of restarting the queue.
> + *    For historical reasons, this routine merely calls blk_set_runtime_active()
> + *    to do the real work of restarting the queue.  It does this regardless of
> + *    whether the device's runtime-resume succeeded; even if it failed the
> + *    driver or error handler will need to communicate with the device.
>    *
>    *    This function should be called near the end of the device's
>    *    runtime_resume callback.
>    */
> -void blk_post_runtime_resume(struct request_queue *q, int err)
> +void blk_post_runtime_resume(struct request_queue *q)
>   {
> -	if (!q->dev)
> -		return;
> -	if (!err) {
> -		blk_set_runtime_active(q);
> -	} else {
> -		spin_lock_irq(&q->queue_lock);
> -		q->rpm_status = RPM_SUSPENDED;
> -		spin_unlock_irq(&q->queue_lock);
> -	}
> +	blk_set_runtime_active(q);
>   }
>   EXPORT_SYMBOL(blk_post_runtime_resume);
>   
> @@ -201,7 +193,7 @@ EXPORT_SYMBOL(blk_post_runtime_resume);
>    * runtime PM status and re-enable peeking requests from the queue. It
>    * should be called before first request is added to the queue.
>    *
> - * This function is also called by blk_post_runtime_resume() for successful
> + * This function is also called by blk_post_runtime_resume() for
>    * runtime resumes.  It does everything necessary to restart the queue.
>    */
>   void blk_set_runtime_active(struct request_queue *q)
> Index: usb-devel/drivers/scsi/scsi_pm.c
> ===================================================================
> --- usb-devel.orig/drivers/scsi/scsi_pm.c
> +++ usb-devel/drivers/scsi/scsi_pm.c
> @@ -262,7 +262,7 @@ static int sdev_runtime_resume(struct de
>   	blk_pre_runtime_resume(sdev->request_queue);
>   	if (pm && pm->runtime_resume)
>   		err = pm->runtime_resume(dev);
> -	blk_post_runtime_resume(sdev->request_queue, err);
> +	blk_post_runtime_resume(sdev->request_queue);
>   
>   	return err;
>   }
> Index: usb-devel/include/linux/blk-pm.h
> ===================================================================
> --- usb-devel.orig/include/linux/blk-pm.h
> +++ usb-devel/include/linux/blk-pm.h
> @@ -14,7 +14,7 @@ extern void blk_pm_runtime_init(struct r
>   extern int blk_pre_runtime_suspend(struct request_queue *q);
>   extern void blk_post_runtime_suspend(struct request_queue *q, int err);
>   extern void blk_pre_runtime_resume(struct request_queue *q);
> -extern void blk_post_runtime_resume(struct request_queue *q, int err);
> +extern void blk_post_runtime_resume(struct request_queue *q);
>   extern void blk_set_runtime_active(struct request_queue *q);
>   #else
>   static inline void blk_pm_runtime_init(struct request_queue *q,
> .
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-14 16:48               ` John Garry
@ 2021-07-14 17:10                 ` Alan Stern
  2021-07-14 17:41                   ` John Garry
  2021-09-28 14:05                 ` Alan Stern
  1 sibling, 1 reply; 13+ messages in thread
From: Alan Stern @ 2021-07-14 17:10 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang (M),
	Xiejianqin

On Wed, Jul 14, 2021 at 05:48:36PM +0100, John Garry wrote:
> > > 
> > > And that looks to solve the deadlock which I was seeing. I'm not sure on
> > > side-effects elsewhere.
> > > 
> > > We'll test it a bit more.
> > 
> > In the absence of any bad reports, here is a proposal for a patch.
> > 
> > Comments?
> > 
> > Alan Stern
> 
> Hi Alan,
> 
> Sorry for not getting back to you sooner. Testing so far with the originally
> proposed change [0] has not raised any issues and has solved the deadlock.
> 
> But we have a list of other problems to deal with in the RPM area related to
> the LLDD/libsas, so were waiting to address all of them (or at least have a
> plan) before progressing this change.
> 
> One such issue is that when we issue the link-reset which causes the device
> to be lost in the test, the disk is not found again. The customer may not be
> happy with this, so we're investigating solutions.
> 
> As for your change itself, I had something similar sitting on our dev
> branch:
> 
> [0] https://github.com/hisilicon/kernel-dev/commit/3696ca85c1e00257c96e40154d28b936742430c4
> 
> For me, I'm happy to hold off on any change, but if you think it's serious
> enough to progress your patch, below, now, then I think that should be ok.

No, I don't think it's all that serious.  The scenario is probably 
pretty rare in real life, outside of a few odd circumstances like yours.  
I'm happy to wait until you're comfortable with a full set of changes.

Alan Stern

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-14 17:10                 ` Alan Stern
@ 2021-07-14 17:41                   ` John Garry
  0 siblings, 0 replies; 13+ messages in thread
From: John Garry @ 2021-07-14 17:41 UTC (permalink / raw)
  To: Alan Stern
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang (M),
	Xiejianqin

On 14/07/2021 18:10, Alan Stern wrote:
>> Hi Alan,
>>
>> Sorry for not getting back to you sooner. Testing so far with the originally
>> proposed change [0] has not raised any issues and has solved the deadlock.
>>
>> But we have a list of other problems to deal with in the RPM area related to
>> the LLDD/libsas, so were waiting to address all of them (or at least have a
>> plan) before progressing this change.
>>
>> One such issue is that when we issue the link-reset which causes the device
>> to be lost in the test, the disk is not found again. The customer may not be
>> happy with this, so we're investigating solutions.
>>
>> As for your change itself, I had something similar sitting on our dev
>> branch:
>>
>> [0]https://github.com/hisilicon/kernel-dev/commit/3696ca85c1e00257c96e40154d28b936742430c4
>>
>> For me, I'm happy to hold off on any change, but if you think it's serious
>> enough to progress your patch, below, now, then I think that should be ok.
> No, I don't think it's all that serious.  The scenario is probably
> pretty rare in real life, outside of a few odd circumstances like yours.
> I'm happy to wait until you're comfortable with a full set of changes.

Fine, and I'll ask for your change to be tested also, even though 
effectively it looks identical to what was tested already.

Thanks,
john

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-07-14 16:48               ` John Garry
  2021-07-14 17:10                 ` Alan Stern
@ 2021-09-28 14:05                 ` Alan Stern
  2021-09-28 14:13                   ` John Garry
  1 sibling, 1 reply; 13+ messages in thread
From: Alan Stern @ 2021-09-28 14:05 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang (M),
	Xiejianqin

On Wed, Jul 14, 2021 at 05:48:36PM +0100, John Garry wrote:
> > > 
> > > And that looks to solve the deadlock which I was seeing. I'm not sure on
> > > side-effects elsewhere.
> > > 
> > > We'll test it a bit more.
> > 
> > In the absence of any bad reports, here is a proposal for a patch.
> > 
> > Comments?
> > 
> > Alan Stern
> 
> Hi Alan,
> 
> Sorry for not getting back to you sooner. Testing so far with the originally
> proposed change [0] has not raised any issues and has solved the deadlock.
> 
> But we have a list of other problems to deal with in the RPM area related to
> the LLDD/libsas, so were waiting to address all of them (or at least have a
> plan) before progressing this change.
> 
> One such issue is that when we issue the link-reset which causes the device
> to be lost in the test, the disk is not found again. The customer may not be
> happy with this, so we're investigating solutions.
> 
> As for your change itself, I had something similar sitting on our dev
> branch:
> 
> [0] https://github.com/hisilicon/kernel-dev/commit/3696ca85c1e00257c96e40154d28b936742430c4
> 
> For me, I'm happy to hold off on any change, but if you think it's serious
> enough to progress your patch, below, now, then I think that should be ok.
> 
> Thanks,
> John

John:

We seem to have forgotten all about this.  I just now noticed that 
this hadn't gotten in 5.15-rc3... and the reason is that it was never 
submitted!

What would you like to do?

Alan Stern

> 
> > 
> > 
> > 
> > John Garry reported a deadlock that occurs when trying to access a
> > runtime-suspended SATA device.  For obscure reasons, the rescan
> > procedure causes the link to be hard-reset, which disconnects the
> > device.
> > 
> > The rescan tries to carry out a runtime resume when accessing the
> > device.  scsi_rescan_device() holds the SCSI device lock and won't
> > release it until it can put commands onto the device's block queue.
> > This can't happen until the queue is successfully runtime-resumed or
> > the device is unregistered.  But the runtime resume fails because the
> > device is disconnected, and __scsi_remove_device() can't do the
> > unregistration because it can't get the device lock.
> > 
> > The best way to resolve this deadlock appears to be to allow the block
> > queue to start running again even after an unsuccessful runtime
> > resume.  The idea is that the driver or the SCSI error handler will
> > need to be able to use the queue to resolve the runtime resume
> > failure.
> > 
> > This patch removes the err argument to blk_post_runtime_resume() and
> > makes the routine act as though the resume was successful always.
> > This fixes the deadlock.
> > 
> > Reported-and-tested-by: John Garry <john.garry@huawei.com>
> > Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
> > Fixes: e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
> > CC: Bart Van Assche <bvanassche@acm.org>
> > CC: Hannes Reinecke <hare@suse.de>
> > CC: <stable@vger.kernel.org>
> > 
> > ---
> > 
> >   block/blk-pm.c         |   22 +++++++---------------
> >   drivers/scsi/scsi_pm.c |    2 +-
> >   include/linux/blk-pm.h |    2 +-
> >   3 files changed, 9 insertions(+), 17 deletions(-)
> > 
> > Index: usb-devel/block/blk-pm.c
> > ===================================================================
> > --- usb-devel.orig/block/blk-pm.c
> > +++ usb-devel/block/blk-pm.c
> > @@ -163,27 +163,19 @@ EXPORT_SYMBOL(blk_pre_runtime_resume);
> >   /**
> >    * blk_post_runtime_resume - Post runtime resume processing
> >    * @q: the queue of the device
> > - * @err: return value of the device's runtime_resume function
> >    *
> >    * Description:
> > - *    Update the queue's runtime status according to the return value of the
> > - *    device's runtime_resume function. If the resume was successful, call
> > - *    blk_set_runtime_active() to do the real work of restarting the queue.
> > + *    For historical reasons, this routine merely calls blk_set_runtime_active()
> > + *    to do the real work of restarting the queue.  It does this regardless of
> > + *    whether the device's runtime-resume succeeded; even if it failed the
> > + *    driver or error handler will need to communicate with the device.
> >    *
> >    *    This function should be called near the end of the device's
> >    *    runtime_resume callback.
> >    */
> > -void blk_post_runtime_resume(struct request_queue *q, int err)
> > +void blk_post_runtime_resume(struct request_queue *q)
> >   {
> > -	if (!q->dev)
> > -		return;
> > -	if (!err) {
> > -		blk_set_runtime_active(q);
> > -	} else {
> > -		spin_lock_irq(&q->queue_lock);
> > -		q->rpm_status = RPM_SUSPENDED;
> > -		spin_unlock_irq(&q->queue_lock);
> > -	}
> > +	blk_set_runtime_active(q);
> >   }
> >   EXPORT_SYMBOL(blk_post_runtime_resume);
> > @@ -201,7 +193,7 @@ EXPORT_SYMBOL(blk_post_runtime_resume);
> >    * runtime PM status and re-enable peeking requests from the queue. It
> >    * should be called before first request is added to the queue.
> >    *
> > - * This function is also called by blk_post_runtime_resume() for successful
> > + * This function is also called by blk_post_runtime_resume() for
> >    * runtime resumes.  It does everything necessary to restart the queue.
> >    */
> >   void blk_set_runtime_active(struct request_queue *q)
> > Index: usb-devel/drivers/scsi/scsi_pm.c
> > ===================================================================
> > --- usb-devel.orig/drivers/scsi/scsi_pm.c
> > +++ usb-devel/drivers/scsi/scsi_pm.c
> > @@ -262,7 +262,7 @@ static int sdev_runtime_resume(struct de
> >   	blk_pre_runtime_resume(sdev->request_queue);
> >   	if (pm && pm->runtime_resume)
> >   		err = pm->runtime_resume(dev);
> > -	blk_post_runtime_resume(sdev->request_queue, err);
> > +	blk_post_runtime_resume(sdev->request_queue);
> >   	return err;
> >   }
> > Index: usb-devel/include/linux/blk-pm.h
> > ===================================================================
> > --- usb-devel.orig/include/linux/blk-pm.h
> > +++ usb-devel/include/linux/blk-pm.h
> > @@ -14,7 +14,7 @@ extern void blk_pm_runtime_init(struct r
> >   extern int blk_pre_runtime_suspend(struct request_queue *q);
> >   extern void blk_post_runtime_suspend(struct request_queue *q, int err);
> >   extern void blk_pre_runtime_resume(struct request_queue *q);
> > -extern void blk_post_runtime_resume(struct request_queue *q, int err);
> > +extern void blk_post_runtime_resume(struct request_queue *q);
> >   extern void blk_set_runtime_active(struct request_queue *q);
> >   #else
> >   static inline void blk_pm_runtime_init(struct request_queue *q,
> > .
> > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SCSI layer RPM deadlock debug suggestion
  2021-09-28 14:05                 ` Alan Stern
@ 2021-09-28 14:13                   ` John Garry
  0 siblings, 0 replies; 13+ messages in thread
From: John Garry @ 2021-09-28 14:13 UTC (permalink / raw)
  To: Alan Stern
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E.J. Bottomley, Hannes Reinecke,
	chenxiang (M),
	Xiejianqin

On 28/09/2021 15:05, Alan Stern wrote:
>> As for your change itself, I had something similar sitting on our dev
>> branch:
>>
>> [0]https://github.com/hisilicon/kernel-dev/commit/3696ca85c1e00257c96e40154d28b936742430c4
>>
>> For me, I'm happy to hold off on any change, but if you think it's serious
>> enough to progress your patch, below, now, then I think that should be ok.
>>
>> Thanks,
>> John
> John:
> 
> We seem to have forgotten all about this.  I just now noticed that
> this hadn't gotten in 5.15-rc3... and the reason is that it was never
> submitted!
> 
> What would you like to do?
> 

Hi Alan,

We're still working through our driver RPM issues internally. It is 
taking a while.

Maybe in the next cycle we will look to upstream.

For now, I'm happy to leave this change pending.

Thanks,
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-28 14:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-02 17:03 SCSI layer RPM deadlock debug suggestion John Garry
2021-07-02 20:31 ` Alan Stern
2021-07-04 23:45   ` Bart Van Assche
2021-07-05 12:00     ` John Garry
2021-07-05 13:17       ` Alan Stern
2021-07-05 13:20         ` Hannes Reinecke
2021-07-07 15:08           ` John Garry
2021-07-14 16:10             ` Alan Stern
2021-07-14 16:48               ` John Garry
2021-07-14 17:10                 ` Alan Stern
2021-07-14 17:41                   ` John Garry
2021-09-28 14:05                 ` Alan Stern
2021-09-28 14:13                   ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.