[QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is

All of lore.kernel.org
 help / color / mirror / Atom feed

* [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
@ 2022-11-28  3:58 Wenchao Hao
  2022-11-28 12:52 ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Wenchao Hao @ 2022-11-28  3:58 UTC (permalink / raw)
  To: Martin K. Petersen, James E.J. Bottomley, Mike Christie, James Bottomley
  Cc: linux-scsi, linfeilong, yanaijie, xuhujie, lijinlin3

static inline bool scsi_status_is_good(int status)                                                                                                                                                             
{
        if (status < 0)
                return false;

        if (host_byte(status) == DID_NO_CONNECT)
                return false;

        /*  
         * FIXME: bit0 is listed as reserved in SCSI-2, but is
         * significant in SCSI-3.  For now, we follow the SCSI-2
         * behaviour and ignore reserved bits.
         */
        status &= 0xfe;
        return ((status == SAM_STAT_GOOD) ||
                (status == SAM_STAT_CONDITION_MET) ||
                /* Next two "intermediate" statuses are obsolete in SAM-4 */
                (status == SAM_STAT_INTERMEDIATE) ||
                (status == SAM_STAT_INTERMEDIATE_CONDITION_MET) ||
                /* FIXME: this is obsolete in SAM-3 */
                (status == SAM_STAT_COMMAND_TERMINATED));
}
We have function defined in include/scsi/scsi.h as above, which is used to check
if the status in result is good.

But the function cleared the lowest bit of SCSI command's status, which would
translate an undefined status to good in some condition, for example the status
is 0x1.

This code seems introduced in this patch:
https://lore.kernel.org/all/1052403312.2097.35.camel@mulgrave/

Is anyone who knows why did we clear the lowest bit? 

We found some firmware or drivers would return status which did not defined in
SAM. Now SCSI middle level do not have an uniform behavior for these undefined
status. I want to change the logic to return error for all status which did not
defined in SAM or define a method in host template to let drivers to judge
what to do in this condition.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28  3:58 [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good Wenchao Hao
@ 2022-11-28 12:52 ` James Bottomley
  2022-11-28 14:41   ` Wenchao Hao
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2022-11-28 12:52 UTC (permalink / raw)
  To: Wenchao Hao, Martin K. Petersen, Mike Christie
  Cc: linux-scsi, linfeilong, yanaijie, xuhujie, lijinlin3

On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> static inline bool scsi_status_is_good(int
> status)                                                              
>                                                                      
>                           
> {
>         if (status < 0)
>                 return false;
> 
>         if (host_byte(status) == DID_NO_CONNECT)
>                 return false;
> 
>         /*  
>          * FIXME: bit0 is listed as reserved in SCSI-2, but is
>          * significant in SCSI-3.  For now, we follow the SCSI-2
>          * behaviour and ignore reserved bits.
>          */
>         status &= 0xfe;
>         return ((status == SAM_STAT_GOOD) ||
>                 (status == SAM_STAT_CONDITION_MET) ||
>                 /* Next two "intermediate" statuses are obsolete in
> SAM-4 */
>                 (status == SAM_STAT_INTERMEDIATE) ||
>                 (status == SAM_STAT_INTERMEDIATE_CONDITION_MET) ||
>                 /* FIXME: this is obsolete in SAM-3 */
>                 (status == SAM_STAT_COMMAND_TERMINATED));
> }
> We have function defined in include/scsi/scsi.h as above, which is
> used to check if the status in result is good.
> 
> But the function cleared the lowest bit of SCSI command's status,
> which would translate an undefined status to good in some condition,
> for example the status is 0x1.
> 
> This code seems introduced in this patch:
> https://lore.kernel.org/all/1052403312.2097.35.camel@mulgrave/
> 
> Is anyone who knows why did we clear the lowest bit? 

It says why in the comment you quote above ... what is unclear about
it?

> We found some firmware or drivers would return status which did not
> defined in SAM. Now SCSI middle level do not have an uniform behavior
> for these undefined status. I want to change the logic to return
> error for all status which did not defined in SAM or define a method
> in host template to let drivers to judge what to do in this
> condition.

Why? The general rule of thumb is be strict in what you emit and
generous in what you receive (which is why reserved bits are ignored).
Is the drive you refer to above not working in some way, in which case
detail it so people can understand the actual problem.

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28 12:52 ` James Bottomley
@ 2022-11-28 14:41   ` Wenchao Hao
  2022-11-28 15:21     ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Wenchao Hao @ 2022-11-28 14:41 UTC (permalink / raw)
  To: jejb, Martin K. Petersen, Mike Christie
  Cc: linux-scsi, linfeilong, yanaijie, xuhujie, lijinlin3

On 2022/11/28 20:52, James Bottomley wrote:
> On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
>> static inline bool scsi_status_is_good(int
>> status)                                                              
>>                                                                      
>>                           
>> {
>>         if (status < 0)
>>                 return false;
>>
>>         if (host_byte(status) == DID_NO_CONNECT)
>>                 return false;
>>
>>         /*  
>>          * FIXME: bit0 is listed as reserved in SCSI-2, but is
>>          * significant in SCSI-3.  For now, we follow the SCSI-2
>>          * behaviour and ignore reserved bits.
>>          */
>>         status &= 0xfe;
>>         return ((status == SAM_STAT_GOOD) ||
>>                 (status == SAM_STAT_CONDITION_MET) ||
>>                 /* Next two "intermediate" statuses are obsolete in
>> SAM-4 */
>>                 (status == SAM_STAT_INTERMEDIATE) ||
>>                 (status == SAM_STAT_INTERMEDIATE_CONDITION_MET) ||
>>                 /* FIXME: this is obsolete in SAM-3 */
>>                 (status == SAM_STAT_COMMAND_TERMINATED));
>> }
>> We have function defined in include/scsi/scsi.h as above, which is
>> used to check if the status in result is good.
>>
>> But the function cleared the lowest bit of SCSI command's status,
>> which would translate an undefined status to good in some condition,
>> for example the status is 0x1.
>>
>> This code seems introduced in this patch:
>> https://lore.kernel.org/all/1052403312.2097.35.camel@mulgrave/
>>
>> Is anyone who knows why did we clear the lowest bit? 
> 
> It says why in the comment you quote above ... what is unclear about
> it?
> 
>> We found some firmware or drivers would return status which did not
>> defined in SAM. Now SCSI middle level do not have an uniform behavior
>> for these undefined status. I want to change the logic to return
>> error for all status which did not defined in SAM or define a method
>> in host template to let drivers to judge what to do in this
>> condition.
> 
> Why? The general rule of thumb is be strict in what you emit and
> generous in what you receive (which is why reserved bits are ignored).
> Is the drive you refer to above not working in some way, in which case
> detail it so people can understand the actual problem.
> 
> James
> 
> .


We come with an issue with megaraid_sas driver. Where scsi_cmnd is completed with result's
status byte set to 1, scsi_io_completion() would finish this scsi_cmnd's request with BLK_STS_OK.
The path is:

scsi_io_completion()
	scsi_io_completion_nz_result()
		scsi_status_is_good()	->  return true
		result = 0;
		*blk_statp = BLK_STS_OK
	scsi_end_request() with BLK_STS_OK

While the scsi_cmnd result is actually wrong and should be finished with BLK_STS_IOERR.


What's more, according to code analysis, we found for scsi_cmnd which completed with status byte
undefined in SAM, the scsi_status_is_good() behave differently. For example, if status byte is in
0x1, 0x3, 0x5, 0x9 and so on, it would return true; while for status in other reserved field,
it would return false. 

So I think we should respect the SAM and drop the line "status &= 0xfe" in scsi_status_is_good()
to tell the caller the status is not good for all status which is not defined in SAM, so
scsi_result_to_blk_status() would return BLK_STS_IOERR on this condition.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28 14:41   ` Wenchao Hao
@ 2022-11-28 15:21     ` James Bottomley
  2022-11-28 17:38       ` Wenchao Hao
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2022-11-28 15:21 UTC (permalink / raw)
  To: Wenchao Hao, Martin K. Petersen, Mike Christie
  Cc: linux-scsi, linfeilong, yanaijie, xuhujie, lijinlin3

On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> On 2022/11/28 20:52, James Bottomley wrote:
> > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
[...]
> > > We found some firmware or drivers would return status which did
> > > not defined in SAM. Now SCSI middle level do not have an uniform
> > > behavior for these undefined status. I want to change the logic
> > > to return error for all status which did not defined in SAM or
> > > define a method in host template to let drivers to judge what to
> > > do in this condition.
> > 
> > Why? The general rule of thumb is be strict in what you emit and
> > generous in what you receive (which is why reserved bits are
> > ignored). Is the drive you refer to above not working in some way,
> > in which case detail it so people can understand the actual
> > problem.
> > 
> > James
> > 
> > .
> 
> 
> We come with an issue with megaraid_sas driver. Where scsi_cmnd is
> completed with result's status byte set to 1, 

Megaraid_sas is an emulation driver for the most part, so it sounds
like this is in the RAID emulation firmware, correct?  The driver can
correct for emulation failures, if you can figure out what it's trying
to signal and convert it to the correct SAM error code. There's no need
to change anything in the layers above.  If you can't figure out the
translation and you want the transfer to error, then add DID_ERROR,
which is a nice catch all.  If this is transient and could be fixed by
a retry, then do DID_SOFT_ERROR instead.

James

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28 15:21     ` James Bottomley
@ 2022-11-28 17:38       ` Wenchao Hao
  2022-11-28 18:12         ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Wenchao Hao @ 2022-11-28 17:38 UTC (permalink / raw)
  To: jejb
  Cc: Wenchao Hao, Martin K. Petersen, Mike Christie, linux-scsi,
	linfeilong, yanaijie, xuhujie, lijinlin3

On Mon, Nov 28, 2022 at 11:24 PM James Bottomley <jejb@linux.ibm.com> wrote:
>
> On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> > On 2022/11/28 20:52, James Bottomley wrote:
> > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> [...]
> > > > We found some firmware or drivers would return status which did
> > > > not defined in SAM. Now SCSI middle level do not have an uniform
> > > > behavior for these undefined status. I want to change the logic
> > > > to return error for all status which did not defined in SAM or
> > > > define a method in host template to let drivers to judge what to
> > > > do in this condition.
> > >
> > > Why? The general rule of thumb is be strict in what you emit and
> > > generous in what you receive (which is why reserved bits are
> > > ignored). Is the drive you refer to above not working in some way,
> > > in which case detail it so people can understand the actual
> > > problem.
> > >
> > > James
> > >
> > > .
> >
> >
> > We come with an issue with megaraid_sas driver. Where scsi_cmnd is
> > completed with result's status byte set to 1,
>
> Megaraid_sas is an emulation driver for the most part, so it sounds
> like this is in the RAID emulation firmware, correct?  The driver can
> correct for emulation failures, if you can figure out what it's trying
> to signal and convert it to the correct SAM error code. There's no need
> to change anything in the layers above.  If you can't figure out the
> translation and you want the transfer to error, then add DID_ERROR,
> which is a nice catch all.  If this is transient and could be fixed by
> a retry, then do DID_SOFT_ERROR instead.
>
> James
>

Thanks for your answer, Of curse we can recognize these undefined status
and map to an error which can be handled by SCSI middle level now. But I
still confused why shouldn't we change the scsi_status_is_good() to
respect to SAM?

I downloaded the sam6r08.pdf from t10.org, the status byte still did not
use the reserved bit. I think we should return an exact for all status
which is undefined in SAM rather than return true for some status
codes(0x1, 0x3, 0x5... and so on), but return false for other status.
I don't think there is any  difference between these undefined status.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28 17:38       ` Wenchao Hao
@ 2022-11-28 18:12         ` James Bottomley
  2022-12-02 13:58           ` Wenchao Hao
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2022-11-28 18:12 UTC (permalink / raw)
  To: Wenchao Hao
  Cc: Wenchao Hao, Martin K. Petersen, Mike Christie, linux-scsi,
	linfeilong, yanaijie, xuhujie, lijinlin3

On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote:
> On Mon, Nov 28, 2022 at 11:24 PM James Bottomley <jejb@linux.ibm.com>
> wrote:
> > 
> > On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> > > On 2022/11/28 20:52, James Bottomley wrote:
> > > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> > [...]
> > > > > We found some firmware or drivers would return status which
> > > > > did not defined in SAM. Now SCSI middle level do not have an
> > > > > uniform behavior for these undefined status. I want to change
> > > > > the logic to return error for all status which did not
> > > > > defined in SAM or define a method in host template to let
> > > > > drivers to judge what to do in this condition.
> > > > 
> > > > Why? The general rule of thumb is be strict in what you emit
> > > > and generous in what you receive (which is why reserved bits
> > > > are ignored). Is the drive you refer to above not working in
> > > > some way, in which case detail it so people can understand the
> > > > actual problem.
> > > > 
> > > > James
> > > > 
> > > > .
> > > 
> > > 
> > > We come with an issue with megaraid_sas driver. Where scsi_cmnd
> > > is completed with result's status byte set to 1,
> > 
> > Megaraid_sas is an emulation driver for the most part, so it sounds
> > like this is in the RAID emulation firmware, correct?  The driver
> > can correct for emulation failures, if you can figure out what it's
> > trying to signal and convert it to the correct SAM error code.
> > There's no need to change anything in the layers above.  If you
> > can't figure out the translation and you want the transfer to
> > error, then add DID_ERROR, which is a nice catch all.  If this is
> > transient and could be fixed by a retry, then do DID_SOFT_ERROR
> > instead.
> > 
> > James
> > 
> 
> Thanks for your answer, Of curse we can recognize these undefined
> status and map to an error which can be handled by SCSI middle level
> now. But I still confused why shouldn't we change the
> scsi_status_is_good() to respect to SAM?

Because it wouldn't be backwards compatible and something might break.
Under SCSI-1, devices were allowed to set this bit to signal vendor
unique status and a lot of manufacturers continued doing this for SCSI-
2, even though it was flagged as reserved instead of vendor specific in
that standard, hence the mask.  Since this was over 20 years ago, it is
possible there is no remaining functional device that does this, but if
it's not causing a problem, why take the risk?

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-11-28 18:12         ` James Bottomley
@ 2022-12-02 13:58           ` Wenchao Hao
  2022-12-02 14:17             ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Wenchao Hao @ 2022-12-02 13:58 UTC (permalink / raw)
  To: jejb, Wenchao Hao
  Cc: Martin K. Petersen, Mike Christie, linux-scsi, linfeilong,
	yanaijie, xuhujie, lijinlin3


On 2022/11/29 2:12, James Bottomley wrote:
> On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote:
>> On Mon, Nov 28, 2022 at 11:24 PM James Bottomley <jejb@linux.ibm.com>
>> wrote:
>>>
>>> On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
>>>> On 2022/11/28 20:52, James Bottomley wrote:
>>>>> On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
>>> [...]
>>>>>> We found some firmware or drivers would return status which
>>>>>> did not defined in SAM. Now SCSI middle level do not have an
>>>>>> uniform behavior for these undefined status. I want to change
>>>>>> the logic to return error for all status which did not
>>>>>> defined in SAM or define a method in host template to let
>>>>>> drivers to judge what to do in this condition.
>>>>>
>>>>> Why? The general rule of thumb is be strict in what you emit
>>>>> and generous in what you receive (which is why reserved bits
>>>>> are ignored). Is the drive you refer to above not working in
>>>>> some way, in which case detail it so people can understand the
>>>>> actual problem.
>>>>>
>>>>> James
>>>>>
>>>>> .
>>>>
>>>>
>>>> We come with an issue with megaraid_sas driver. Where scsi_cmnd
>>>> is completed with result's status byte set to 1,
>>>
>>> Megaraid_sas is an emulation driver for the most part, so it sounds
>>> like this is in the RAID emulation firmware, correct?  The driver
>>> can correct for emulation failures, if you can figure out what it's
>>> trying to signal and convert it to the correct SAM error code.
>>> There's no need to change anything in the layers above.  If you
>>> can't figure out the translation and you want the transfer to
>>> error, then add DID_ERROR, which is a nice catch all.  If this is
>>> transient and could be fixed by a retry, then do DID_SOFT_ERROR
>>> instead.
>>>
>>> James
>>>
>>
>> Thanks for your answer, Of curse we can recognize these undefined
>> status and map to an error which can be handled by SCSI middle level
>> now. But I still confused why shouldn't we change the
>> scsi_status_is_good() to respect to SAM?
> 
> Because it wouldn't be backwards compatible and something might break.
> Under SCSI-1, devices were allowed to set this bit to signal vendor
> unique status and a lot of manufacturers continued doing this for SCSI-
> 2, even though it was flagged as reserved instead of vendor specific in
> that standard, hence the mask.  Since this was over 20 years ago, it is
> possible there is no remaining functional device that does this, but if
> it's not causing a problem, why take the risk?
> 
> James
> 
> .

Hi James, thank you very much for your answer.

I think we should think about the following functions of megaraid driver:

megasas_complete_cmd() defined in drivers/scsi/megaraid/megaraid_sas_base.c,
megasas_complete_cmd
	...
	switch (hdr->cmd) {
	...
	case MFI_CMD_LD_READ:
	case MFI_CMD_LD_WRITE:
		switch (hdr->cmd_status) {
		case MFI_STAT_SCSI_DONE_WITH_ERROR:
			cmd->scmd->result = (DID_OK << 16) | hdr->scsi_status;
			break;
		...
		}
	...
	}

map_cmd_status() defined in drivers/scsi/megaraid/megaraid_sas_fusion.c 
map_cmd_status
	...
	switch (status) {
	...
	case MFI_STAT_SCSI_DONE_WITH_ERROR:
		scmd->result = (DID_OK << 16) | ext_status;
		break;
	...
	}

Both of these functions did not check the status byte, which can not make
sure the status byte is defined in SAM.

What we meet is the status byte set to 1, and the host_byte is set to DID_OK.

In this condition, the scsi_cmnd would be finished by scsi middle layer with BLK_STS_OK
if the kernel version is before 3d45cefc8edd7 (scsi: core: Drop obsolete Linux-specific
SCSI status codes).

Because scsi_io_completion_nz_result() would return a non zero value, so we call
scsi_io_completion_action() to handle this command.
While in scsi_io_completion_action(), the blk_stat mapped by scsi_result_to_blk_status()
is BLK_STS_OK which finally result in the command finished with BLK_STS_OK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-12-02 13:58           ` Wenchao Hao
@ 2022-12-02 14:17             ` James Bottomley
  2022-12-02 16:26               ` Wenchao Hao
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2022-12-02 14:17 UTC (permalink / raw)
  To: Wenchao Hao, Wenchao Hao
  Cc: Martin K. Petersen, Mike Christie, linux-scsi, linfeilong,
	yanaijie, xuhujie, lijinlin3

On Fri, 2022-12-02 at 21:58 +0800, Wenchao Hao wrote:
> 
> On 2022/11/29 2:12, James Bottomley wrote:
> > On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote:
> > > On Mon, Nov 28, 2022 at 11:24 PM James Bottomley
> > > <jejb@linux.ibm.com>
> > > wrote:
> > > > 
> > > > On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> > > > > On 2022/11/28 20:52, James Bottomley wrote:
> > > > > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> > > > [...]
> > > > > > > We found some firmware or drivers would return status
> > > > > > > which
> > > > > > > did not defined in SAM. Now SCSI middle level do not have
> > > > > > > an
> > > > > > > uniform behavior for these undefined status. I want to
> > > > > > > change
> > > > > > > the logic to return error for all status which did not
> > > > > > > defined in SAM or define a method in host template to let
> > > > > > > drivers to judge what to do in this condition.
> > > > > > 
> > > > > > Why? The general rule of thumb is be strict in what you
> > > > > > emit
> > > > > > and generous in what you receive (which is why reserved
> > > > > > bits
> > > > > > are ignored). Is the drive you refer to above not working
> > > > > > in
> > > > > > some way, in which case detail it so people can understand
> > > > > > the
> > > > > > actual problem.
> > > > > > 
> > > > > > James
> > > > > > 
> > > > > > .
> > > > > 
> > > > > 
> > > > > We come with an issue with megaraid_sas driver. Where
> > > > > scsi_cmnd
> > > > > is completed with result's status byte set to 1,
> > > > 
> > > > Megaraid_sas is an emulation driver for the most part, so it
> > > > sounds
> > > > like this is in the RAID emulation firmware, correct?  The
> > > > driver
> > > > can correct for emulation failures, if you can figure out what
> > > > it's
> > > > trying to signal and convert it to the correct SAM error code.
> > > > There's no need to change anything in the layers above.  If you
> > > > can't figure out the translation and you want the transfer to
> > > > error, then add DID_ERROR, which is a nice catch all.  If this
> > > > is
> > > > transient and could be fixed by a retry, then do DID_SOFT_ERROR
> > > > instead.
> > > > 
> > > > James
> > > > 
> > > 
> > > Thanks for your answer, Of curse we can recognize these undefined
> > > status and map to an error which can be handled by SCSI middle
> > > level
> > > now. But I still confused why shouldn't we change the
> > > scsi_status_is_good() to respect to SAM?
> > 
> > Because it wouldn't be backwards compatible and something might
> > break.
> > Under SCSI-1, devices were allowed to set this bit to signal vendor
> > unique status and a lot of manufacturers continued doing this for
> > SCSI-
> > 2, even though it was flagged as reserved instead of vendor
> > specific in
> > that standard, hence the mask.  Since this was over 20 years ago,
> > it is
> > possible there is no remaining functional device that does this,
> > but if
> > it's not causing a problem, why take the risk?
> > 
> > James
> > 
> > .
> 
> Hi James, thank you very much for your answer.
> 
> I think we should think about the following functions of megaraid
> driver:
> 
> megasas_complete_cmd() defined in
> drivers/scsi/megaraid/megaraid_sas_base.c,
> megasas_complete_cmd
>         ...
>         switch (hdr->cmd) {
>         ...
>         case MFI_CMD_LD_READ:
>         case MFI_CMD_LD_WRITE:
>                 switch (hdr->cmd_status) {
>                 case MFI_STAT_SCSI_DONE_WITH_ERROR:
>                         cmd->scmd->result = (DID_OK << 16) | hdr-
> >scsi_status;
>                         break;
>                 ...
>                 }
>         ...
>         }
> 
> map_cmd_status() defined in
> drivers/scsi/megaraid/megaraid_sas_fusion.c 
> map_cmd_status
>         ...
>         switch (status) {
>         ...
>         case MFI_STAT_SCSI_DONE_WITH_ERROR:
>                 scmd->result = (DID_OK << 16) | ext_status;
>                 break;
>         ...
>         }
> 
> Both of these functions did not check the status byte, which can not
> make sure the status byte is defined in SAM.

Right, but the first one should be returning actual status from the
drive, so should be OK.  The second one looks to be returning
manufactured raid status, which is likely the problem.

in either case, just fix the code to return DID_ERROR<<16 if the status
is non SAM conforming.

> What we meet is the status byte set to 1, and the host_byte is set to
> DID_OK.
> 
> In this condition, the scsi_cmnd would be finished by scsi middle
> layer with BLK_STS_OK if the kernel version is before 3d45cefc8edd7
> (scsi: core: Drop obsolete Linux-specific SCSI status codes).

I don't believe it does: that commit should produce identical code
before and after; it merely replaced the shifted status conditions with
the unshifted ones.

> Because scsi_io_completion_nz_result() would return a non zero value,
> so we call scsi_io_completion_action() to handle this command.
> While in scsi_io_completion_action(), the blk_stat mapped by
> scsi_result_to_blk_status()
> is BLK_STS_OK which finally result in the command finished with
> BLK_STS_OK.


So as I said previously, get the driver to return DID_ERROR<<16 for the
bogus SAM conditions.

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good
  2022-12-02 14:17             ` James Bottomley
@ 2022-12-02 16:26               ` Wenchao Hao
  0 siblings, 0 replies; 9+ messages in thread
From: Wenchao Hao @ 2022-12-02 16:26 UTC (permalink / raw)
  To: jejb
  Cc: Wenchao Hao, Martin K. Petersen, Mike Christie, linux-scsi,
	linfeilong, yanaijie, xuhujie, lijinlin3

On Fri, Dec 2, 2022 at 10:17 PM James Bottomley <jejb@linux.ibm.com> wrote:
>
> On Fri, 2022-12-02 at 21:58 +0800, Wenchao Hao wrote:
> >
> > On 2022/11/29 2:12, James Bottomley wrote:
> > > On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote:
> > > > On Mon, Nov 28, 2022 at 11:24 PM James Bottomley
> > > > <jejb@linux.ibm.com>
> > > > wrote:
> > > > >
> > > > > On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> > > > > > On 2022/11/28 20:52, James Bottomley wrote:
> > > > > > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> > > > > [...]
> > > > > > > > We found some firmware or drivers would return status
> > > > > > > > which
> > > > > > > > did not defined in SAM. Now SCSI middle level do not have
> > > > > > > > an
> > > > > > > > uniform behavior for these undefined status. I want to
> > > > > > > > change
> > > > > > > > the logic to return error for all status which did not
> > > > > > > > defined in SAM or define a method in host template to let
> > > > > > > > drivers to judge what to do in this condition.
> > > > > > >
> > > > > > > Why? The general rule of thumb is be strict in what you
> > > > > > > emit
> > > > > > > and generous in what you receive (which is why reserved
> > > > > > > bits
> > > > > > > are ignored). Is the drive you refer to above not working
> > > > > > > in
> > > > > > > some way, in which case detail it so people can understand
> > > > > > > the
> > > > > > > actual problem.
> > > > > > >
> > > > > > > James
> > > > > > >
> > > > > > > .
> > > > > >
> > > > > >
> > > > > > We come with an issue with megaraid_sas driver. Where
> > > > > > scsi_cmnd
> > > > > > is completed with result's status byte set to 1,
> > > > >
> > > > > Megaraid_sas is an emulation driver for the most part, so it
> > > > > sounds
> > > > > like this is in the RAID emulation firmware, correct?  The
> > > > > driver
> > > > > can correct for emulation failures, if you can figure out what
> > > > > it's
> > > > > trying to signal and convert it to the correct SAM error code.
> > > > > There's no need to change anything in the layers above.  If you
> > > > > can't figure out the translation and you want the transfer to
> > > > > error, then add DID_ERROR, which is a nice catch all.  If this
> > > > > is
> > > > > transient and could be fixed by a retry, then do DID_SOFT_ERROR
> > > > > instead.
> > > > >
> > > > > James
> > > > >
> > > >
> > > > Thanks for your answer, Of curse we can recognize these undefined
> > > > status and map to an error which can be handled by SCSI middle
> > > > level
> > > > now. But I still confused why shouldn't we change the
> > > > scsi_status_is_good() to respect to SAM?
> > >
> > > Because it wouldn't be backwards compatible and something might
> > > break.
> > > Under SCSI-1, devices were allowed to set this bit to signal vendor
> > > unique status and a lot of manufacturers continued doing this for
> > > SCSI-
> > > 2, even though it was flagged as reserved instead of vendor
> > > specific in
> > > that standard, hence the mask.  Since this was over 20 years ago,
> > > it is
> > > possible there is no remaining functional device that does this,
> > > but if
> > > it's not causing a problem, why take the risk?
> > >
> > > James
> > >
> > > .
> >
> > Hi James, thank you very much for your answer.
> >
> > I think we should think about the following functions of megaraid
> > driver:
> >
> > megasas_complete_cmd() defined in
> > drivers/scsi/megaraid/megaraid_sas_base.c,
> > megasas_complete_cmd
> >         ...
> >         switch (hdr->cmd) {
> >         ...
> >         case MFI_CMD_LD_READ:
> >         case MFI_CMD_LD_WRITE:
> >                 switch (hdr->cmd_status) {
> >                 case MFI_STAT_SCSI_DONE_WITH_ERROR:
> >                         cmd->scmd->result = (DID_OK << 16) | hdr-
> > >scsi_status;
> >                         break;
> >                 ...
> >                 }
> >         ...
> >         }
> >
> > map_cmd_status() defined in
> > drivers/scsi/megaraid/megaraid_sas_fusion.c
> > map_cmd_status
> >         ...
> >         switch (status) {
> >         ...
> >         case MFI_STAT_SCSI_DONE_WITH_ERROR:
> >                 scmd->result = (DID_OK << 16) | ext_status;
> >                 break;
> >         ...
> >         }
> >
> > Both of these functions did not check the status byte, which can not
> > make sure the status byte is defined in SAM.
>
> Right, but the first one should be returning actual status from the
> drive, so should be OK.  The second one looks to be returning
> manufactured raid status, which is likely the problem.
>
> in either case, just fix the code to return DID_ERROR<<16 if the status
> is non SAM conforming.
>
> > What we meet is the status byte set to 1, and the host_byte is set to
> > DID_OK.
> >
> > In this condition, the scsi_cmnd would be finished by scsi middle
> > layer with BLK_STS_OK if the kernel version is before 3d45cefc8edd7
> > (scsi: core: Drop obsolete Linux-specific SCSI status codes).
>
> I don't believe it does: that commit should produce identical code
> before and after; it merely replaced the shifted status conditions with
> the unshifted ones.
>
Here is how the commit affect the flow:

When the command finished with status set to 1, and the actual
finished byte is less than requested.

Before this commit, scsi_io_completion_nz_result would check
result like following:
if (status_byte(result) && scsi_status_is_good(result)) {
        result = 0;
        *blk_statp = BLK_STS_OK;
}
If the lowest byte of result is 1, the status_byte() returned false,
the result would not be updated. So scsi_io_completion_nz_result()
returned a non zero value. We would call scsi_io_completion_action()
to handle this command after scsi_end_request() failed.

In scsi_io_completion_action(), the command should be finished
with BLK_STS_IOERR, while it is finished with BLK_STS_OK.

After this commit, scsi_io_completion_nz_result would check
result like following:
if (result & 0xff && scsi_status_is_good(result)) {
        result = 0;
        *blk_statp = BLK_STS_OK;
}
Where result && 0xff is 1, the result would be 0, and
scsi_io_completion_nz_result() returned 0. We would call
scsi_mq_requeue_cmd() to requeue this command after
scsi_end_request() failed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-02 16:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-28  3:58 [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good Wenchao Hao
2022-11-28 12:52 ` James Bottomley
2022-11-28 14:41   ` Wenchao Hao
2022-11-28 15:21     ` James Bottomley
2022-11-28 17:38       ` Wenchao Hao
2022-11-28 18:12         ` James Bottomley
2022-12-02 13:58           ` Wenchao Hao
2022-12-02 14:17             ` James Bottomley
2022-12-02 16:26               ` Wenchao Hao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.