[PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
@ 2021-01-26 13:02 Hannes Reinecke
  2021-01-26 13:21 ` Martin Wilck
  2021-01-28  0:51 ` michael.christie
  0 siblings, 2 replies; 7+ messages in thread
From: Hannes Reinecke @ 2021-01-26 13:02 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Hannes Reinecke,
	Martin Wilck, Hannes Reinecke

When a command is return with DID_TRANSPORT_DISRUPTED we should
be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
the command if set.
Otherwise multipath will be requeuing a command on the failed
path and not fail it over to one of the working paths.

Cc: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_error.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index a52665eaf288..005118385b70 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
 	case DID_TIME_OUT:
 		goto check_type;
 	case DID_BUS_BUSY:
+	case DID_TRANSPORT_DISRUPTED:
 		return (scmd->request->cmd_flags & REQ_FAILFAST_TRANSPORT);
 	case DID_PARITY:
 		return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-26 13:02 [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED Hannes Reinecke
@ 2021-01-26 13:21 ` Martin Wilck
  2021-01-28  0:51 ` michael.christie
  1 sibling, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2021-01-26 13:21 UTC (permalink / raw)
  To: hare, martin.petersen; +Cc: hch, linux-scsi, james.bottomley, Hannes Reinecke

On Tue, 2021-01-26 at 14:02 +0100, Hannes Reinecke wrote:
> When a command is return with DID_TRANSPORT_DISRUPTED we should
> be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
> the command if set.
> Otherwise multipath will be requeuing a command on the failed
> path and not fail it over to one of the working paths.
> 
> Cc: Martin Wilck <martin.wilck@suse.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>

Reviewed-by: Martin Wilck <mwilck@suse.com>


-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-26 13:02 [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED Hannes Reinecke
  2021-01-26 13:21 ` Martin Wilck
@ 2021-01-28  0:51 ` michael.christie
  2021-01-28  6:57   ` Hannes Reinecke
  1 sibling, 1 reply; 7+ messages in thread
From: michael.christie @ 2021-01-28  0:51 UTC (permalink / raw)
  To: Hannes Reinecke, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Martin Wilck,
	Hannes Reinecke

On 1/26/21 7:02 AM, Hannes Reinecke wrote:
> When a command is return with DID_TRANSPORT_DISRUPTED we should
> be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
> the command if set.
> Otherwise multipath will be requeuing a command on the failed
> path and not fail it over to one of the working paths.
> 
> Cc: Martin Wilck <martin.wilck@suse.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>   drivers/scsi/scsi_error.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index a52665eaf288..005118385b70 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
>   	case DID_TIME_OUT:
>   		goto check_type;
>   	case DID_BUS_BUSY:
> +	case DID_TRANSPORT_DISRUPTED:
>   		return (scmd->request->cmd_flags & REQ_FAILFAST_TRANSPORT);
>   	case DID_PARITY:
>   		return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);

We don't fast fail for that error code to avoid churn for transient 
transport problems. The FC and iscsi drivers block the rport/session, 
return that code and then it's up the fast_io_fail/replacement timeout.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-28  0:51 ` michael.christie
@ 2021-01-28  6:57   ` Hannes Reinecke
  2021-01-28 13:40     ` Ewan D. Milne
  2021-01-28 18:01     ` Mike Christie
  0 siblings, 2 replies; 7+ messages in thread
From: Hannes Reinecke @ 2021-01-28  6:57 UTC (permalink / raw)
  To: michael.christie, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Martin Wilck,
	Hannes Reinecke

On 1/28/21 1:51 AM, michael.christie@oracle.com wrote:
> On 1/26/21 7:02 AM, Hannes Reinecke wrote:
>> When a command is return with DID_TRANSPORT_DISRUPTED we should
>> be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
>> the command if set.
>> Otherwise multipath will be requeuing a command on the failed
>> path and not fail it over to one of the working paths.
>>
>> Cc: Martin Wilck <martin.wilck@suse.com>
>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>> ---
>>   drivers/scsi/scsi_error.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index a52665eaf288..005118385b70 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
>>       case DID_TIME_OUT:
>>           goto check_type;
>>       case DID_BUS_BUSY:
>> +    case DID_TRANSPORT_DISRUPTED:
>>           return (scmd->request->cmd_flags & REQ_FAILFAST_TRANSPORT);
>>       case DID_PARITY:
>>           return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);
> 
> We don't fast fail for that error code to avoid churn for transient 
> transport problems. The FC and iscsi drivers block the rport/session, 
> return that code and then it's up the fast_io_fail/replacement timeout.
> 
_But_ if prevents that command to be failed over to another path, so 
essentially we're blocking execution until fast_io_fail tmo.

For no good reason as we have other paths available.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-28  6:57   ` Hannes Reinecke
@ 2021-01-28 13:40     ` Ewan D. Milne
  2021-01-28 18:01     ` Mike Christie
  1 sibling, 0 replies; 7+ messages in thread
From: Ewan D. Milne @ 2021-01-28 13:40 UTC (permalink / raw)
  To: Hannes Reinecke, michael.christie, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Martin Wilck,
	Hannes Reinecke

On Thu, 2021-01-28 at 07:57 +0100, Hannes Reinecke wrote:
> On 1/28/21 1:51 AM, michael.christie@oracle.com wrote:
> > On 1/26/21 7:02 AM, Hannes Reinecke wrote:
> > > When a command is return with DID_TRANSPORT_DISRUPTED we should
> > > be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
> > > the command if set.
> > > Otherwise multipath will be requeuing a command on the failed
> > > path and not fail it over to one of the working paths.
> > > 
> > > Cc: Martin Wilck <martin.wilck@suse.com>
> > > Signed-off-by: Hannes Reinecke <hare@suse.com>
> > > ---
> > >   drivers/scsi/scsi_error.c | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/scsi/scsi_error.c
> > > b/drivers/scsi/scsi_error.c
> > > index a52665eaf288..005118385b70 100644
> > > --- a/drivers/scsi/scsi_error.c
> > > +++ b/drivers/scsi/scsi_error.c
> > > @@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd
> > > *scmd)
> > >       case DID_TIME_OUT:
> > >           goto check_type;
> > >       case DID_BUS_BUSY:
> > > +    case DID_TRANSPORT_DISRUPTED:
> > >           return (scmd->request->cmd_flags &
> > > REQ_FAILFAST_TRANSPORT);
> > >       case DID_PARITY:
> > >           return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);
> > 
> > We don't fast fail for that error code to avoid churn for
> > transient 
> > transport problems. The FC and iscsi drivers block the
> > rport/session, 
> > return that code and then it's up the fast_io_fail/replacement
> > timeout.
> > 
> 
> _But_ if prevents that command to be failed over to another path, so 
> essentially we're blocking execution until fast_io_fail tmo.
> 
> For no good reason as we have other paths available.

And if we don't?  Not everybody sets queue_if_no_path, right?

-Ewan
 
> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-28  6:57   ` Hannes Reinecke
  2021-01-28 13:40     ` Ewan D. Milne
@ 2021-01-28 18:01     ` Mike Christie
  2021-01-28 18:21       ` Mike Christie
  1 sibling, 1 reply; 7+ messages in thread
From: Mike Christie @ 2021-01-28 18:01 UTC (permalink / raw)
  To: Hannes Reinecke, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Martin Wilck,
	Hannes Reinecke

On 1/28/21 12:57 AM, Hannes Reinecke wrote:
> On 1/28/21 1:51 AM, michael.christie@oracle.com wrote:
>> On 1/26/21 7:02 AM, Hannes Reinecke wrote:
>>> When a command is return with DID_TRANSPORT_DISRUPTED we should
>>> be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
>>> the command if set.
>>> Otherwise multipath will be requeuing a command on the failed
>>> path and not fail it over to one of the working paths.
>>>
>>> Cc: Martin Wilck <martin.wilck@suse.com>
>>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>>> ---
>>>   drivers/scsi/scsi_error.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>> index a52665eaf288..005118385b70 100644
>>> --- a/drivers/scsi/scsi_error.c
>>> +++ b/drivers/scsi/scsi_error.c
>>> @@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
>>>       case DID_TIME_OUT:
>>>           goto check_type;
>>>       case DID_BUS_BUSY:
>>> +    case DID_TRANSPORT_DISRUPTED:
>>>           return (scmd->request->cmd_flags & REQ_FAILFAST_TRANSPORT);
>>>       case DID_PARITY:
>>>           return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);
>>
>> We don't fast fail for that error code to avoid churn for transient transport problems. The FC and iscsi drivers block the rport/session, return that code and then it's up the fast_io_fail/replacement timeout.
>>
> _But_ if prevents that command to be failed over to another path, so essentially we're blocking execution until fast_io_fail tmo.
> 
> For no good reason as we have other paths available.

It is for a good reason. Causing a failover to another path can lead
to other resources being shifted and putting the system out of
balance. For active/active it's most likely fine, but for active
passive targets and it might be better to wait a second or two to see
if we can recover the optimal/preferred path.

For iscsi if the user has set fast_io_fail tmo > 0 then they want the
delayed behavior. If the user wants it failed immediately they set
fast_io_tmo (iscsi replacement timeout) to zero. Note for fc you would
need to add code to do the same since zero means off.

The other issue is that your patch only solves part of the problem.
It would prevent the IO that went through the above code path and new IO
from waiting for fast_io_fail. It does not handle the other IO that would
be caught between dm-multipath and the driver when the rport/session block
completes/starts. In the end the app is probably going to get stuck waiting
for fast io fail tmo to fire.

I think for both reasons, you want to just make it so the user can set
fast io fail tmo in a way that some value means fail immediately or do
something completely new.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED
  2021-01-28 18:01     ` Mike Christie
@ 2021-01-28 18:21       ` Mike Christie
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Christie @ 2021-01-28 18:21 UTC (permalink / raw)
  To: Hannes Reinecke, Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Martin Wilck,
	Hannes Reinecke

On 1/28/21 12:01 PM, Mike Christie wrote:
> On 1/28/21 12:57 AM, Hannes Reinecke wrote:
>> On 1/28/21 1:51 AM, michael.christie@oracle.com wrote:
>>> On 1/26/21 7:02 AM, Hannes Reinecke wrote:
>>>> When a command is return with DID_TRANSPORT_DISRUPTED we should
>>>> be looking at the REQ_FAILFAST_TRANSPORT flag and do not retry
>>>> the command if set.
>>>> Otherwise multipath will be requeuing a command on the failed
>>>> path and not fail it over to one of the working paths.
>>>>
>>>> Cc: Martin Wilck <martin.wilck@suse.com>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>>>> ---
>>>>   drivers/scsi/scsi_error.c | 1 +
>>>>   1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>>> index a52665eaf288..005118385b70 100644
>>>> --- a/drivers/scsi/scsi_error.c
>>>> +++ b/drivers/scsi/scsi_error.c
>>>> @@ -1753,6 +1753,7 @@ int scsi_noretry_cmd(struct scsi_cmnd *scmd)
>>>>       case DID_TIME_OUT:
>>>>           goto check_type;
>>>>       case DID_BUS_BUSY:
>>>> +    case DID_TRANSPORT_DISRUPTED:
>>>>           return (scmd->request->cmd_flags & REQ_FAILFAST_TRANSPORT);
>>>>       case DID_PARITY:
>>>>           return (scmd->request->cmd_flags & REQ_FAILFAST_DEV);
>>>
>>> We don't fast fail for that error code to avoid churn for transient transport problems. The FC and iscsi drivers block the rport/session, return that code and then it's up the fast_io_fail/replacement timeout.
>>>
>> _But_ if prevents that command to be failed over to another path, so essentially we're blocking execution until fast_io_fail tmo.
>>
>> For no good reason as we have other paths available.
> 
> It is for a good reason. Causing a failover to another path can lead
> to other resources being shifted and putting the system out of
> balance. For active/active it's most likely fine, but for active
> passive targets and it might be better to wait a second or two to see
> if we can recover the optimal/preferred path.
> 
> For iscsi if the user has set fast_io_fail tmo > 0 then they want the
> delayed behavior. If the user wants it failed immediately they set
> fast_io_tmo (iscsi replacement timeout) to zero. Note for fc you would
> need to add code to do the same since zero means off.
>  
> The other issue is that your patch only solves part of the problem.
> It would prevent the IO that went through the above code path and new IO
> from waiting for fast_io_fail. It does not handle the other IO that would
> be caught between dm-multipath and the driver when the rport/session block
> completes/starts. In the end the app is probably going to get stuck waiting
> for fast io fail tmo to fire.
> 
> I think for both reasons, you want to just make it so the user can set
> fast io fail tmo in a way that some value means fail immediately or do
> something completely new.

For that last part, I think in the short term we could just fix up 
csi_transport_fc to work like iscsi, but for the longer term to handle
cases like you have N>1 paths in a multipath group and are doing
active/active across them but then have another group where you have to do
active/passive over, then we could do something new.

I think we have what we have now, because back when it was made we couldn't
tell dm-multipath what the error was, so we had to handle it in both layers
like this.

Now, we could fail immediately, not block the queues so IO would not get stuck
in there, and then dm-multipath and multipathd could do something smart with
the error since they know the reason it was failed, and the path and path group
layout, and have more info on the device like if it is active passive. For compat
and to support what both of us need, we could have a timeout/setting for paths
within a path group and one for switching path groups. We could also do different
behaviors of the error was DID_TRANSPORT_DISRUPTED vs DID_TRANSPORT_FAILFAST
since for the latter the transport timers have fired.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-01-28 18:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-26 13:02 [PATCH] scsi: do not retry FAILFAST commands on DID_TRANSPORT_DISRUPTED Hannes Reinecke
2021-01-26 13:21 ` Martin Wilck
2021-01-28  0:51 ` michael.christie
2021-01-28  6:57   ` Hannes Reinecke
2021-01-28 13:40     ` Ewan D. Milne
2021-01-28 18:01     ` Mike Christie
2021-01-28 18:21       ` Mike Christie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.